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Some Aspects of Selective Readout from Iconic Storage 
M. T. Turvey* 

Hasklns Laboratories, New Haven 



Two experiments are reported which examined the delayed - irtlal 
sampling of tachlstoscoplcally presented displays. The first exper- 
iment compared partial report by row and partial report by color for 
displays of colored discs. A significant Interaction was observed 
between selection criterion and delay of report, and partial report 
for both criteria was superior to whole report. These results for 
the arrays of colored discs were replicated In the second part of 
Experiment II, the first part of which also showed, however, that 
when the displays consisted of colored letters the decline In accuracy 
of partial report by row paralleled that of partial report by color. 
The results are discussed In terms of the distinction between pre- 
attentlve and focal-attentive processes. More generally the two 
experiments are presented as supporting the hypothesis that perfor- 
mance In Iconic memory tasks is jointly determined by iconic storage 
and short-term storage. 

The notion of transient storage of visual material prior to categorization 
has assumed a central role in recent constructlvlst (Nelsser, 1967) and Infonr- 
atlon-processlng discussions (Broadbent, 1971; Haber, 1969; Turvey, 1971) 
of visual perception. Nelsser (1967) has suggested the term "iconic" for this 
kind of brief memory. The idea of an early buffer memory in perceptual systems 
is. of course, not new; Broadbent (1958) and Pollack (1959), for example, had 
pointed earlier to the theoretical need for such a concept in the analysis of 
auditory perception. Moreover, Woodworth (1938) essentially presaged our 
present conceptualizations of the iconic store: 

The primary memory image has less of a definite quality than the 
visual after-image and is distinguished by the fact that it does not 
move with the eye, but remains stationary in the p^ace where the 

objects were exposed .... [T] hese after-lmag allow a few seconds 

extra for the cerebral response to the data supplied by the retina, 
(p. 692) 
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Of course, the degree to which such a stimulus representation is evident 
varies with the perceptual task under examination. Brief, precategorical 
storage is more likely to be observed in tasks where an overload of items is 
presented for a limited duration, or where several items are presented simul- 
taneously on a number of different channels, or where the relevant response 
categories are delayed (see Posner, 1963)* 

The procedure most favored for isolating and examining iconic storage is 
the delayed partial-sampling paradigm introduced by Sperling (1960), and 
Averbach and Coriell (1961). Essentially, the paradigm consists of displaying 
tachistoscopically a number of items, usually letters or digits, in excess of 
the memory span and following this display after a brief interval by an instruc- 
tion to report a part of the display* The interesting feature of this paradigm 
is that this selective instruction, provided that it is given within milli- 
seconds after the display, may give a measure of item availability superior to 
that obtained in a noninstructed case where reports as many items as possible. 
The instructed-noninstructed difference permits the inference of a large- 
capacity store; the precipitous reduction of this difference with delay of 
instruction permits the inference of rapid decay. 

In view of current theorizing on the interaction between memory systems, 
it is probably advisable to adopt the same attitude toward the delayed partial- 
sampling, or iconic memory (IM), paradigm that 4e have adopted toward the short- 
term memory (STM) distractor and probe paradigms. In short, it is argued, 
contra -y to earlier positions, that data obtained from STM tasks are never pure 
indicants of the hypothesized short-term storage (STS) mechanism for categorized 
material. The present view is that the probability of recalling an item in a 
STM task is determined by the presence of the item in STS, or by its presence 
in long-term storage (LTS), or by both (Atkinson and Shiffrin, 1968; Waugh and 
Norman, 1965). Thus, by the same token, accuracy in reporting an item in an IM 
task is determined by the presence of a representation of that item in iconic 
storage (is), or by a representation in STS, or by both. 

This view of the IM task has been expressed explicitly by Averbach and 
Coriell (1961), who argued that performance in the delayed partial-sampling para- 
digm is the result of two different types of performance on the part of S. One 
is a nonselective readout, independent of the appearance of the instruction cue; 
the other is a selective readout, which occurs only subsequent to the decoding 
of the instruction. Nonselective readout is suggested by the fact that perfor- 
mance never Appears to approach zero in delayed partial-sampling experiments; 
instead it asymptotes at the level of noninstructed, or whole, report. Therefore, 
we have to assume that ^ begins to categorize material and enter it into STS as 
soon as possible, at least before the instruction cue. On occurrence of the cue, 
some of the designated material may have been processed already; just how much 
depends on the size of the display and the overlap between preselected and cued 
items* In any event a S^'s cued report in an IM task can be based, in part, on 
STS, where STS is viewed as consisting of both an abstract visual code and a 
name code (see Coltheart, 1972). 

The need for emphasizing that data obtained from IM tasks are not neces- 
sarily pure indicators of IS will become apparent in the two experiments reported 
here which took as their departure point the experiments of Clark (1969). 



EXPERIMENT I 



Clark (1969) investigated IM for three-by-five niatrlces of intermixed 
colored discs using the selection criteria of location and color. Instructions 
to report by color asked S to designate the locations in the matrix occupied by, 
say, red discs. Instructions to report by location, on the other hand, required 
that S specify the colors of the discs which occurred in the five locations of, 
say, the bottom row. For both means of accessing IM, partial report was signifi 
cantly superior to whole report, i.e., noninstructed report. But of special 
interest was Clark's finding that i^ile the accuracy of partial report by 
location declined with delay of the instruction cue, accuracy of partial report 
by color did not. Experiment I which sought to verify this observation of 
Clark's compared performance with the two selection criteria in a single within- 
Ss design; in Clark's investigation, report by location and report by color 
were examined in separate experiments with different S^s. 

Method 

Subjects . The S^s were four undergraduates at the University of Connecticut 
who participated in the experiment as a course requirement. 

Stimulus materials and apparatus . Discs were outlined on sheets of red, 
green, and yellow plastic. These were then cut out and placed onto a white 
background for photographing. Forty-eight slides were made, each with twelve 
colored discs, four each of red, green, and yellow, arranged in three rows cf 
four. Each of the forty-eight, three-by-four arrays of colored discs was con- 
structed by assigning, at random, a colored disc to a location by the procedure 
of selection without replacement from the set of twelve colored discs. 

A Lafayette T-2K Constant Illumination Projecting Tachistoscope was used 
to project the slides onto a viewing screen at a distance of 50 cm from S^. The 
field, so viewed, subtended a visual angle of 6.0 deg vertical by 8.5 deg 
horizontal. At this viewing distance, the diameter of each disc subtended 
1.6 deg, and the separation between discs was .8 deg within a row and 1.2 deg 
within a column. One channel constantly illuminated at 8 ft L a pre- and post- 
exposure fixation field at the center of which was a faint, but discernible, 
cross. The slides were exposed for 80 msec at a luminance of 20 ft L. In the 
partial report conditions the slide display was followed by one of the following 
tones~2,000 Hz, 600 Hz, 200 Hz, signalling, respectively, top, middle, or 
bottom row, or red, green, or yellow. Four tone delays of 0, 100, 300, and 
1000 msec were used. These delays were measured from display offset. The 
exposure duration, tone delay, and tone duration were controllea by three Hunter 
timers. 

Procedure . The same general procedure was followed for all trials of all 
conditions: S was instructed to view the cross in the fixation field until it 
appeared in focus, at which point S pressed a key to trigger the display of a 
slide. Following the display S recorded his response on a response grid using 
a separate response grid for each trial. In the partial-report condition a 
tone indicator occurred at a predet(^rmined interval after termination of the 
display, cuing S to report items by row or color. In the whole-report condition 
no tone occurred; on termination of the display S attempted to report as many 
items as possible. The responses of S on each trial were scored for the number 
of discs reported in their correct positions. In partial report, the average 



proportion of discs correctly reported was taken as an estimate of the proportion 
of the whole display available to S, I.e., an estimate of the proportion of all 
the locations In the matrix for which S had correct color Information. 

Three days of practice on the IM task preceded the experiment proper. The 
purpose of the three practice days was to acquaint S with the general procedure 
to provide practice in discriminating between the three tones and, most impor- 
tant, to insure familiarity with the cuing functions of the tones. On the 

average practic ions lasted 1 to 1-1/2 hours and included a total of 100 

trials divided into two blocks of 20 trials of whola report and four blocks of 
20 trials of partial report (two blocks of report by row and two of report by 
color). At the end of each trial on Days 1 and 2, S was given feedback on the 
accuracy of his performance. The experiment proper was conducted on Days 4 and 
3. Two |s on Day 4 were given twelve trials of whole report followed by ninety- 
six trials of partial report by row, twenty-four trials at each of the four 
intervals randomly interspersed in the series of ninety-six, and finally twelve 
more trials of whole report. The other two Ss received the same number, and 
order, of whole- and partial-report trials but with partial report by color. 

^^'^ procedure with Ss receiving the partial-report condition 

that they had not received on Day 4. 

Before each session on both days Ss were given several sequences of the 

^^''^ reliably identify which was the high, which was 

the middle, and which was the low. And prior to each block of partial-report 
trials S was given practice in identifying the cue function of the tones relevant 
to that partial-report condition. 

Throughout the training and experiment days, the tone-row and tone-color 
combinations were counterbalanced so that each tone specified each row and each 
color an equal number of times at each delay interval. Within blocks of 
partial-report trials the tones were randomized. 

Results and Discussion 

Essentially, the logic behind this type of experiment is that if a selection 
criterion is efficient the whole-partial difference should be significant. We 
may suppose that efficient selection criteria, so defined, reflect the character 
of the iconic store. Either they identify those properties of stimulation that 

J available or they point to those properties which can be most rapidly 
ascertained at this stage in the flow of visual information in the nervous 
system. 

The results of the experiment are shown in Figure 1. A repeated-measures 
analysis of variance was conducted on the proportion ot items reported in the 
whole-report and the O-secdelay partial- report conditions. This difference was 

^^^^ " "-^^^ P ^ -OO^- ^ «"*ly«i« performed on the 

total number of discs correctly reported by each S at each indicator delay for 
both kinds of selection. This analysis showed that the main effect of delay, 
. ' J . ^ interaction between selection criterion and 

indicator delay, F(3,9) - 5.33, p < .025, were significant but selection criterion 
as a main effect was not, F(l,3) - 1.06, p > .03. 

In the main these results confirm the major findings of Clark: color and 
location were both efficient selection criteria and the temporal course of report 
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Figure 1: Partial report by color and by row of disc 
arrays as a function of Indicator delay In 
Experiment I. 



ERIC 



5 



by color differed from that of report by location. The present experiment, 
however, did not show that report by location was superior to report by color, 
suggesting that the superiority of location evident In Clark's data was probably 
due to between-^s differences. In addition, Clark reported that report by color 
was Invariant with Indicator delay but a separate analysis of the color selec- 
tion data of the present experiment revealed a significant decline In perfor- 
mance across delay Intervals, F(3,9) - 4.45, p< .05. 

There are at least three hy- p -.jhlch speak to the question of why 
different IM functions were cbt w. . report by location and by color. One 

hypothesis proposes that the decay .ate of the Iconic material differed for the 
two modes of selection. This possibility seems unlikely given that the same 
set of stimulus features, color and location, was required for report In both 
selection modes. Moreover this hypothesis contradicts the view that the decay 
parameter of IS Is a structural property of the system and. In keeping with the 
distinction drawn by Atkinson and Shlffrln (1968), Is not modifiable by control 
processes available to S (cf., Doost and Turvey, 1971). 

Another hypothesis views the different IM functions as the result of the 
different types of uncertainties operating In the two conditions. In report by 
color. Item uncertainty Is zero but spatial uncertainty Is high; In report by 
row, on the other hand. Item uncertainty Is higher than spatial uncertainty. In 
view of the slight bat nonsignificant tendency for color selection to be better 
than row selection It might have to be argued, on this hypothesis, that Item 
uncertainty Is more detrimental than spatial uncertainty In the present IM task. 
However, It Is far from clear how these differences In Item and spatial uncer- 
tainty would produce the obtained Interaction between selection criterion and 
Indicator delay. 

A third hypothesis, the one favored here, derives from two distinctions: 
the distinction drawn by Nelsser (1967) between preattentlve and focal-attentive 
processes and the distinction made earlier between IM and IS. 

Preattentlve processes are viewed as relatively crude preliminary operations 
that segregate the optical array into Ainlts which are then acted upon by focal 
attention, a process which makes extensive contact with LTS and Is essential for 
pattern recognition. In the present experiment, report by location requested 
S to name the colors of the discs In each of a specified set of locations. 
Report by color, on the other hand, requested S to specify the location occupied 
In the matrix by discs of a designated color. The requirement to name the disc 
colors in report by. location suggests the Involvement of focal attention; by 
contrast, performance in the report -by-color condition could have been mediated 
primarily, if not solely, by the products of preattentlve mechanisms. All that 
is needed in the report-by-color condition is that the elements in the matrix be 
segregated into patterns by color, which is not an unreasonable demand since pre- 
attentlve processes are analogous to the modes of perceptual organization 
described by Gestalt psychology (Nelsser, 1967). The organizing principle 
Invoked here is that of grouping according to similarity. Given the resolution 
of the array into color patterns, S may now simply enter these patterns into STS 
without awaiting the partial-report instruction. Indeed, only two patterns need 
be entered—which, of course, is well within the limits of STS capacity— since 
the disc locations of the remaining color class could be remembered by elimina- 
tion. [An interpretation of this kind was proposed by Keele and Chase (1967) 
for the failure of Eriksen and Steffy (1964) to obtain a decline .n partial-report 
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accuracy with Indicator delay when the stimuli were six-item binary arrays.] 
On this interpretation the advantage of partial report by color over whole report 
lies in the reduction of information requiring rehearsal, i.e., central process- 
ing capacity (Posner, 1966), or in the reduction of output interference. 

The idea, therefore, is that performance in the present IM task with color 
as the selection criterion was determined primarily by STS. Of course entering 
patterns into STS was an option open to S in report by location; in that case, 
however, encoding the stimulus in this way would not be especially useful since 
reporting the colors of a row would require a relatively complicated decoding 
operation. In short, the difference between the two selection modes, on this 
view, is that report by color involves only preattentive processes and was 
relatively more dependent on STS than on IS, while reporu by location required 
focal-attentive operations and had to depend more on the less persistent IS 
representation. 

At first blush it might seem that focal attention or figural synthesis 
(Neisser, 1967) is the sine qua non for determining the establishment of more 
persistent modes of representation, i,e«, for effecting the translation from IS 
into forms suitable for storage in STS and LTS. There are, however, several 
reasons for doubting this. 

1 Most notable is the fact that one can remember certain gross characteris- 

l tics of a visual event some considerable time after the event has occurred and 

I without having known more detailed or categorical properties of the event. For 

\ example, one can remember that something occurred, without ever having known 

f what that something actually was, or one can remember that a particular location 

i was occupied, without having known the identity of the occupant. In delayed 

1 partial-sampling experiments S^s may have a rough idea of how many items were 

I presented in the array without knowing what they were (Eriksen and Rohrbaugh, 

I 1970). In dichotic listening experiments S^s can report, after a relatively 

\ lengthy delay, the voice quality of an unattended message but not know the 

\ semantic content of the message (e.g.. Cherry, 1953). In brief, the products 

^ of preattentive processes like those of focal-attentive processes can enjoy the 
privileges of post- iconic, categorical stores. 

EXPERIMENT II 

If the interpretation of the selection criteria by delay interaction of 
Experiment I is basically correct, then it should be possible to eliminate this 
interaction by requiring focal-attentive processes in both report by location 
and report by color. In the second expe^-lment, report by location and by color 
were examined, with letters as the to-be-reported items. Report by location 
asked: what were the letters in these locations? While report by color asked: 
where were the letters of this color and what were they? In the selection-by- 
color condition of this second experiment, each location occupied by an object 
of a particular color would have to be examined in order to synthesize/ identify 
the object's name. Thus, the selection-by-color condition of the second 
experiment differed from that of the first in that focal attention was needed 
to produce the required response. Entering the color patterns into STS would 
give little advantage since the process of naming would have to make use of the 
content of IS. In brief, IM performance with letter arrays for both selection 
criteria in Experiment II should be determined primarily by IS. 
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Experiment II was conducted In the same manner as Experiment I and 
Included a replication of Experiment I for purposes of comparison. The experi- 
ment was conducted In two parts. 

Method 

Subjects . The S^s were three University of Connecticut graduate students, 
who volunteered their services, and the author. The same four ^s participated 
In both parts of the experiment. 

Stimulus materials and apparatus . Using the same method and materials 
used for making the disc slides, forty-eight new slides were made each with 
twelve colored letters, four each of red, green, and yellow arranged In three 
rows of four. The twelve letters were: C, F, H, J, L, N, P, S, T, U, X, Z, 
and no letter was repeated within a slide. Because of the large number of 
slides required for complete counterbalancing of letter, color, and location, 
the following procedure was used. For any given slide twelve letters were 
assigned at random to the twelve locations by the method of selection without 
replacement from the set of twelve letters. A color was then randomly assigned 
to a letter In a location by a similar procedure. The letters In the display 
subtended 1.6 deg vertical and on average 1.2 deg horizontal. The average 
separation between columns was .8 deg and between rows It was 1.2 deg. The 
apparatus, tones, delay Intervals, and all other viewing measurements were the 
same as those of Experiment I. The three-by-four disc slides used In the second 
part of the experiment were the same as those described previously In Experiment 



Procedure. The two parts of the experiment were conducted over seven days 
with the first three days as training days. On Days 4 and 5 the experiment 
proper was conducted with the letter slides (Part I). Days 6 and 7 were used 
to run the Experiment I replication with the disc slides (Part II) . The pro- 
cedure used over these seven days followed the pattern outlined In Experiment 

I, except that all training was done on the letter slides. 

Results and Discussion 

Each S on each trial of both Parts I and II was scored for the number of 
Items r^'.ported In their correct location. In the partial-report conditions, 
the average percentage of Items correctly reported In a cued row, or of a cued 
color, was taken as the estimate of the total number of locations In the array 
for which S^ had correct letter (Part I) or color (Part II) Information. Figure 2 
shows the relation between these percentages and Indicator delay for both 
selection criteria; also Included are the whole-report means for both Parts I and 

II. Inspection of Figure 2 lends support to the hypothesis under test: the 
relation between the two selection criteria In Part I differed fundamentally 
from that In Part II. A repeated-measures analysis of variance performed on the 
Part I data showed that report by location was superior to report by color, 
F(l,3) -54.74, p < .01, and that the main effect of Indicator delay was signifi- 
cant, F(3,9) - 64.10, p < .001; however, there was no significant Interaction 
between the selection-criteria functions (F < 1) . Quite to the contrary was the 
outcome of the same kind of analysis of the Part II data which showed that the 
selection criterion by delay Interaction was significant F(3,9) » 15.14, p < .001. 
Also, although the difference between report by color and report by location was 
not significant, F(l,3) » 7.15, .05< p < .10, inspection of Figure 2 suggests 
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Figure 2: Partial report by color and by row of letter 
(Part I) and disc (Part II) arrays as a func- 
tion of indicator delay in Experiment II. 



that. If anything, color was superior to location as a selection criterion, 
contrary to Part I. And of course, as Inspection of Figure 2 further suggests, 
the main effect of Indicator delay was significant, F(3,9) - 16.52, p < .001. 
All of these results for the disc arrays replicate those of Experiment I. 

Two further separate analyses were conducted. One compared partial report 
of letters by color at 0-msec delay with the whole report for the letter arrays 
and found a significant difference, F(l,3) - 15.93, p < .05. The other showed 
that accuracy of report by color In Part II declined significantly with delay 
of Indicator, F(3,9) - 4.21, p < .05. 

In sum, these analyses lend support to the Interpretation that disc 
selection by color was conducted differently from disc selection by location or 
letter selection by either criterion. The hypothesis advanced above argues 
that performance In the latter three conditions required focal attention and, 
therefore, was more dependent on IS, while performance In the former condition 
did not require focal attention and relied for the most part on a representation 
XII SXS ■ 

In Part I of the present experiment selection by location was superior to 
selection by color, and the magnitude of the difference between the two was 
relatively constant across all delays of Indicator. Both conditions Involved 
high Item uncertainty but spatial uncertainty was pronounced only m the selec- 
tlon-by-color condition, and this might have accounted for the difference In 
performance between the two conditions (see Bennett, 1971, for a discussion of 
Item and spatial uncertainty In IM tasks) . We may compare the situation In 
Part I to that existing m Part II. There, item uncertainty was limited to 
selection by location, and selection by color Involved only spatial uncertainty. 
Although the difference was not significant, selection by color tended to be 
superior to selection by location, an observation corroborated by Experiment I. 
What this implies is that spatial uncertainty £er se could not have accounted 
tor the Inferior selectlon-by-color performance in Part II of the present experi- 
ment. More probably the inferior performance was due to spatial uncertainty 
coupled with item uncertainty and the resulting extra or at least different 
demands on processing that this coupling brought about. We might suppose, as 
we did above, that the presence of item uncertainty in the selectlon-by-color 
condition of Part I (as opposed to the selectlon-by-color condition with the 
disc arrays in Part II) prohibited S from taking advantage of the patterns 
entered into STS following preattentlve segregation. Indeed, entering the segre- 
gated color patterns into STS may well be prohibitive under the task demands of 
identifying letters. 

GENERAL DISCUSSION 

The present paper has suggested that both IS and STS have to be considered 
fo?:^^^ °^ ^ performance. Recently in a series of papers by Holding 
(1970 1971) and Dick (1971) the existence, or at least utility, of is has bLn 
questioned and the implication has been made that IM performance is based solely 
on STS. Holding (1970) argued that in IM tasks where the selection criterion is 
row, if s^^ could predict which row would be cued and if he then fixated on the 
expected row, the estimate of available items derived from partial report would 
be Inflated. In support of his argument Holding demonstrated that performance 
In an IM task varied systematically with the predictability of the cue. His con- 
clusion therefore was that the concept of selection from IS was not needed co 
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explain the difference between partial report and whole report. However, 
while Holding's experiments suggest caution In the construction of cue sequences 
when the selection criterion Is row, his explanation of the partial-whole 
difference In terms of fixation strategies cannot apply to IM situations where 
nonspatlal selection criteria such as color, size, brightness, or shape are 
used (e.g., Turvey and Kravetz, 1970; von Wright, 1968, 1970). 

Evidence that S^s are using different strategies, perhaps relying differently 
on IS and STS under conditions of spatial and nonspatlal selection criteria. Is 
strongly Implied by Table I. Table I shows the number of times Ss In Part I of 
Experiment II responded with zero, one, two, three, or four letters In their 
correct locations as a function of cue delay for both selection by row and by 
color. 1 (The data on one S are not Included In Table I because her overall 



TABLE I 



Frequencies of Response Categories for 
Report by Row and by Color^ 







Delay 




Category 




(msec) 
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23 
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30 
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Color 
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20 


24 


15 
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Row 
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15 
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21 


100 












Color 


11 
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21 


14 
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Row 


6 


15 


20 


14 


17 


300 












Color 


12 


20 


25 


11 


4 


Row 


17 


22 


18 


4 


11 


1000 












Color 


18 


32 


15 


6 


1 







^The sums of the frequencies are not Identical 
across the rows of the table because Ss f^ome- 
tlmes misinterpreted the tone Indicator. This 
resulted In the occasional loss of a trial. 



^Thls analysis was suggested by Joel Klelnberg. 
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performance was so high that most of her responses were in the four category for 
both selection criteria.) At all delays with row as the selection criterion, 
there were more perfect responses in the four category than there were responses 
with three items correct. Moreover, the frequency of three items correct was 
consistently less at all delays than the frequency of two items correct. Thus 
the distributions at each delay were relatively normal with the exception of the 
four category. By comparison the distributions for selection by color show no 
irregularity in the four category, and at all delays the distributions are 
normal. Obviously ^s are not behaving in the same way in the two selection 
criterion conditions; selection by row does seem to tiake advantage of fixational 
or attentional biases of the kind suggested by Holding. Moreover we may conclude 
from inspection of Table I that partial report by row made greater use of SIS 
than partial report by color. 

In a similar vein. Holding (1970) and Dick (1971) have pointed out that 
because more items have to be output in whole report than in partial report, 
there is more output interference in the former condition than in the latter, 
resulting in an artifically inflated difference between the two. As we have 
noted. Holding's view is that IS does not exist or at least cannot be accessed, 
i.e., it is not useful to S. Thus partial-report performance characterizes STS 
just as whole-report performance does, and any difference between the two per- 
formances represents nothing more than some artifact of measurement. Against 
this argument, however, are two kinds of evidence which show that partial report 
and whole report do not reflect completely identical memorial representations. 
In the first place there are the data of Averbach and Sperling (1961) and Keele 
and Chase (1967) which show that luminance conditions affect partial report but 
do not affect whole report. In the second place Sharf and Lefton (1970) have 
shown that an after-coming pattern mask which does not impair whole-report per- 
formance at delays greater than 50 msec (cf., Sperling, 1963) impairs partial- 
report performance even when delayed 250 msec. In sum, there is good reason to 
believe that the delayed partial sampling of a visually presented array of items 
depends on information available in a storage medium other than STS. 
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Voice-Timing Perception in Spanish Word- Initial Stops* 

Arthur S. Abramson and Leigh Lisker 
Haskins Laboratories, New Haven 



In the general phonetic literature it is commonly stated that languages 
use such phonetic features as voicing, aspiration, glottalization, implosion, 
"tensity," etc., to distinguish consonants produced at the same supraglottal 
place of articulation. In previous work we have argued (Lisker and Abramson, 
1971) and to some extent demonstrated (Lisker et al,, 1969; Sawashima et al,, 
1970) that some of these features are entirely or largely explainable in terms 
of laryngeal control. Our view has been that the timing of events at the 
glottis relative to supraglottal articulation provides a simple description of 
how this laryngeal control is manifested (Abramson and Lisker, 1970a). 1 In our 
earlier work on this subject (Lisker and Abramson, 1964), we measured voice 
onset time (VOT) in word-initial stop consonants across a number of languages. 
VOT, the interval between the release of the stop and the onset of phonation as 
shown in spectrograms, was the simplest single measure we could find in the 
acoustic signal of the timing of laryngeal adjustments. The dimension proved 
efficacious in acoustically differentiating stop consonants in most of the 
languages with two, and even three, phonological categories at each place of 
articulation.^ 

In the present study we wanted to determine the nature of the relations 
between VOT as varied in synthetic speech and the labeling and discrimination 
behavior of Spanish speakers whose two stop categories differ phonetically 
from the two of English. This is a continuation of studies reported earlier 
(Abramson and Lisker, 1965, 1970b; Lisker and Abramson, 1970). 



This is a revised version of a paper given at the 83rd Meeting of the Acous- 
tical Society of America, Buffalo, N. Y., 18-21 April 1972. 

Also University of Connecticut, Storrs. 

Also University o^ Pennsylvania, Philadelphia. 

Recent electromyographic work on laryngeal muscles lends support to this view 
for English consonants (Hirose and Gay, in press). A helpful schematic 
picture of the temporal relations is given by P. Ladefoged (1971:10). 

A fourth category examined, voiced aspiration, clearly involves glottal 
adjustments but not of the kind that is discernible on the VOT dimension. 
Our current electromyographic work with Hajime Hirose, however, does show 
that this category is distinguished from the others, at least in part, by 
temporal factors in the contraction of intrinsic muscles of the larynx. 



w/l5 



To control VOT in measured Increments we used the Haskins Laboratories 
fonnant synthesizer. Our basic pattern was three steady-state fonnants for a 
vowel of the type [a]. Labial, apical, and velar stop releases were simulated 
by means of appropriate formant transitions. We synthesl2.ed thirty-seven VOT 
variants ranging from 150 msec before the release to 150 msec after It. For 
voicing before the release (voicing lead), we used only low-frequency harmonics 
of the buzz source. For voice onset after release (voicing lag), the Interval 
between release and onset of the periodic source was excited by hiss alone, 
with suppression of the first formant to simulate the well-known f Irst-formant 
cutback" (cf., Llberman et al., 1958). Three conditions of VOT for synthetic 
labial stops are shown In Figure 1. The thirty-seven VOT variants thus 
generated were recorded on tape In eight random orders for each place of artic- 
ulation and played to a total of twelve native speakers of Latin American 
Spanish who, using Spanish orthography, were to Identify the stimuli with their 
stop phonemes. Instructions were prepared In Spanish and given to the subjects 
to help Insure that they would apply Spanish categories to the stimuli. 

The twelve subjects used In the Identification experiments were not dla- 
lec tally homogeneous, coming from Puerto Rico and some six nations of Central 
and South America. To the best of our knowledge, there Is not enough Information 
about phonetic variation In the Spanish dialects of Latin America with regard to 
the voicing feature to help explain Individual differences In our data. 3 For our 
part, we had too small a sampling of subjects from each of the areas represented 
to make any dlalectologlcal statements based on the results of our experiments. 
The subjects were all more or less bilingual In Spanish and English, having 
studied English for some years. At the time of starting the experiment, most of 
them had been In the United States'* no more than one year, two of them less than 
six months, and one for five years. Although they varied considerably In English 



A search of the literature, with the much-appreciated bibliographical help of 
Gardiner H. London of the University of Connecticut, yields no statement 
describing Instability of voicing In word- Initial /b d g/ or, for that matter, 
unexpected aspiration In /p t k/. This Is true of general works (e.g., 
Lope Blanch, 1968) and descriptions of varieties of Spanish represented In our 
sampling of test subjects: Argentinean (Malmberg, 1950; Vldal de Battlnl, 
1964), Colombian (FliJrez, 1964), Cuban (Lopez, 1971), Mexican (Lope Blanch, 
1964; Harris, 1969), and Puerto Rlcan (Navarro, 1948). These authors call 
attention only to dialectal differences In the positional and lexical distribu- 
tion of stop and fricative allophones. Harris (1969:41) affirms, at least for 
the cultivated speech of Mexico City, that voicing lead Is "clearly audible 
under good acoustical conditions." Lope Blanch (1964:88) comments that In the 
Yucatan Spanish of Mexico, stops are glottallzed because of the Mayan substra- 
tum. It Is possible, of course, that certain recent trends In pronunciation 
have not been documented. Malcah Yaeger of the University of Pennsylvania has 
observed (personal communication) devolclng of /b d g/ In some regional and 
social dialects. 

Wr single Puerto Rlcan subject had been on the mainland close to fifteen 
months. He may well have had much more contact with English than the others, 
but no effects on his Spanish were discerned. 
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Three Conditions of Voice Onset Time 
Synthetic Labial Stops 




Figure 1: From top to bottom, spectrograms of voicing lead, s.light 
lag, and long lag. 
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proficiency, all but one of them showed marked Spanish phonic and syntactic 
interference in their English. The one exception was an excellent bilingual 
with a barely detectable Spanish accent and seemingly native English grammar. 
To help insure against the probability of English interference in the Spanish 
ot our subjects, we chose them with the aid of Spanish language consultants 
at Queens College of the City University of New York and the University of 
Connecticut, where the tests were run. Our screening of the subjects, done in 
hiring interviews by our consultants, was perhaps too superficial to rule out 
entirely the possibility of any phonic interference from their exposure to 
English, but for the very recently arrived individuals, at least, the likeli- 
hood was small. 

Figure 2 gives the results of these tests. On the abscissa, negative 
numbers are assigned to voicing lead and positive numbers to lag, while the 
moment of stop release is labeled zero. The stimuli varied in 10-msec steps, 
except for the range of -10 to +50, where we made them in 5-msec steps. For 
each place of articulation, the identification curves are functions of VOT 
values. The synthetic patterns clearly provided enough cues for two good per- 
ceptual categories at each place of articulation. The 50 percent crossover 
points are given in the table below with the comparable English points, repro- 
duced from our earlier work (Lisker and Abramson, 1970; Fig. 2) for comparison. 



Spanish and English Category Boundaries 
in Perception of Voice Timing 

(msec) 

Spanish English 

Labial +14 ^25 

Apical +22 +35 

Velar +24 +42 



The Spanish perceptual crossovers have lower VOT values than the English 
for all places of articulation. This is consistent with the fact that English 
initial /p t k/ show considerable voicing lag, i.e., aspiration, in stressed 
llllf- V'."^ Spanish /p t k/ show little or no voicing lag and are unaspi- 
rated; furthermore. Spanish /b d g/ are characterized by voicing lead. i.e. 
voicing during the occlusion, whereas their English counterparts seem normally 
to show VOT values of about zero (Lisker and Abramson. 1964:392. 394). 



The 1970 study also includes VOT identification functions for the three-way 
voicing distinction of Thai. Perceptual data derived from tests with somewhat 
similar stimuli have been presented for Dutch (Slis and Cohen. 1969). 
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Figure 2: Perceptual identification of VOT variants by native speakers 
of Spanish. Pooled data. 
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dlscrtJn^MiJ; ?^ interest In the effects of linguistic experience upon the 
nil Z °f ol^f^*"*'^ a phonologlcally relevant continuum 

rsSvI^\r ^ ^" y**" investigated across languages 

/i;; l^^^s. the testing of dlscrlm- 

t^l ; variants In English and Thai (Abramson and Llsker. 1970b). has 

l^L^ l^n earlier, covering the span from -150 to +150 msec In 

presented these variants m triads as an oddity task. In 
Zr^^l^ .l'^ Identical and one v«s different. The task to 

tilt f *J " ''^^ ^" """•^^ °' third position. The triads 

were made by pairing stimuli at 2-. 3-. and 4-step Intervals along the contln- 
r^he ^riT'^f ^ differences of 20. 30. and 4o'„«ec. Several pemutaJJois 
rL ^ *f randomizations of the test series were presented to some of 

the native speakers of Spanish who had taken the identification tests. Very 
nerlod 0^%?!^^?'^'^ were able to stay with the experiment over a long enough 
IZtlLli u *""°"l*te a large number of data points for each comparlson;6 

suMeftrinroir/^' ^^^^f'^ °" ^^^^ presented them for inSivldual 

subjects and only for two places of articulation. 

l^velf of^'J^Jf! °f/^f "c\r ^^^^^^ discrimination curves for all three 
levels of difficulty tor Subject MP. Each point on a curve is placed equidistant 
^o the^w discriminated. The line placed perpenSLur 

lLT.y<T<T If. ""^^ ""'^ ^° P*^""*^ perceptual crossover point in 

JnatloS^Li r"'i'?*"°" '^'■^ ^« precisely at the df^crlm- 

iSoni?o ? ^ J" ! indicating considerable correlation with the 

phonological boundary. Note, however, the additional one or two small peaks for 
higher values of voicing lag. ^""•o 

At the bottom of Figure 3 we see the discrimination data for LQ. At his 
50 percent crossover point, shown by the vertical line at -15 msec, there is a 
4-step discrimination peak of 77 percent. There are. however, two other large 
discrimination peaks at +20 msec and +70 msec, and the one at +20 msec is 
95 percent, condiderably higher than the peak at the phoneme boundary. Both 
subjects, then, especially LQ. seem to show effects other than the linguistic. 

r./*"* discrimination of VOT in velar stops is shown for Subject EL at the top 
- if! —.^^ - -27 msec is undeJ a discrlUn':t"S 



npak t-hat- Q£. Z — w.w^ „w maeu xs unaer a aiscrimmat 

llZrlu I t P"""*^- I" addition, his voicing lag discrimination is 

?he wai ^riJ* ^ ''t? P**"^ °' ^° """"'^ +70 msec. EL. by 

w^T^^In c .T/"''^*''' described earlier as an excellent bilingual with 
5^/ iL /W r his English. VOT measurements of his inltla: 

/g/ and /k/ in recordings of Spanish words7 yield a boundary that corresponds 



6 

Strange and Halwes (1971) have shown, using our VOT stimuli, that the use of 
confidence ratings in the oddity task can save much testing time. By the time 
they had shown this, we were too far along in our Spanish experiments to 
modify our discrimination procedures. Had we used confidence ratings, we 
might have been able to salvage another two or three subjects. An Important 
discussion of discrimination procedures in experiments on the perception of 
speech sounds is found in Pisoni (1971). 

^We failed to obtain voice recordings of the other subjects. We made a special 
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SPANISH LABIAL DISCRIMINATION 




SPANISH VELAR DISCRIMINATION 




ure 4: Velar discrimination functions for two individual subj 



with his Identification crossover point. That Is, his /g/ and /k/ ranges about 
at +30 msec. He deviates from previously examined Spanish speakers (Llsker and 
Abramson, 1964:402) In producing Instances of /g/— Indeed, 56 percent of the 
time — with no voicing lead. His background makes It hard to rule out English 
Interference. He had his elementary and secondary schooling at an American 
school In Lima, Peru, where he studied English for thirteen years. In addition, 
he had spent four and a half years In the United States when the experiment 
began. 

Ihe velar data for JP, shown at the bottom of Figure 4, are somewhat more 
complicated. His velar Identification data do not reveal a single 50 percent 
crossover point; rather they show a zone of ambiguity between /g/ and /k/ from 
-8 msec to +20 msec. We have placed two vertical lines on the time axis to 
show this span. A discrimination peak reaching 97 percent straddles the right 
end of the crossover zone, while a smaller peak straddles the left end; there 
Is a third peak around +90 msec. 

The perceptual efficacy of VOT as a sufficient cue for distinguishing the 
voiced and voiceless stops of Spanish seems established. Thr possible Informa- 
tion-bearing value of other particular acoustic features sometimes associated 
with voicing distinctions, e.g., pitch (Haggard et al., 1970; Fujlmura, 1971) 
and Fl transitions (Cooper et al., 1952:600; Stevens and Klatt, 1971), we 
believe, is also ascrlbable to the relative timing of events at the larynx 
and the supraglottal place of articulation. The question of the Influence of 
linguistic categories on the performance of discrimination tasks, at least as 
far as the present study Is concerned. Is more complicated. The presence of 
a phonological boundary certainly has an effect, more with some subjects than 
others, but there are also discrimination peaks remote from the phonological 
boundary and Indeed always In the lag end of uhe continuum where spectral 
variation Is somewhat more complex. That Is, even though In Spanish and In 
many other languages the presence or absence of voicing lead Is an Important 
cue to a phonological category, stop variants with voicing lag are just easier 
to discriminate on some psychoacoustlc basis. Our earlier work with English 
and Thai showed similar effects, but they are much more striking here. Of 
course, there is also the possibility that the psychoacoustlc effect Is com- 
bined with a linguistic one In the sense that large values of lag may sound so 
aspirated to the Spanish ear that they are considered foreign by the listener 
and therefore well discriminated from others judged to be more Spanish-like. 
We have no theoretical rationale for predicting how naive listeners might pro- 
cess a range of speech sounds which lies well outside the norms of their native 
language, whether they treat these sounds in effect as belonging to a single 
category of "foreign" speech sound or as some sort of nonspeech continuum. 8 



effort to record EL's speech because of his unusual background. His English 
stops, recorded in a separate session, look native by and large but seem to 
show a slight Spanish interference. 

8 

For either way of processing these sounds, we do not know what kind of dis- 
crimination function the subjects would show. See the discussion in Mattingly 
et al. (1971:152-154). 
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Some Effects of Oral Anesthesia upon Speech: An Electromyographic Investigation* 

Gloria Jones Borden"*" 

Hasklns Laboratories, New Haven 



It has been a long-observed fact that when one comes from the dentist's 
office there Is often a disturbance of clearly articulated speech until the 
effect of the anesthesia has disappeared. It Is understandable, therefore, 
that Investigators Interested In afferent control of speech should block the 
sensory nerves of normal speakers with anesthesia In order to study the rela- 
tionship between feedback from the oral area and articulation of speech. Pre- 
sumably all feedback channels are used to develop language, audition, taction, 
and proprioception. The question Is whether skilled speakers need depend upon 
these feedback possibilities during ongoing speech and to what degree or under 
what circumstances each channel may play a role. Is learned speech centrally 
patterned, with little or no need under normal circumstances for peripheral 
control? A series of studies during the 1950s and 'eOs dealt with this subject. 
It was found that bilateral mandibular and intraorbital injections of anesthe- 
sia Increased the number of judged errors in articulation of adult speakers 
(McCroskey, 1958; Rlngel and Steer, 1963). The speech distortions were found 
to be subtle and were most evident in the production of fricatives and affri- 
cates (Scott, 1970; Borden, 1971; Gammon, Smith, Danilof, and Kim, 1971). It 
was assumed by the investigators that the speech effect was the result of 
decreased oral sensation as a result of blocking sensory feedback from the 
tongue via the lingual nerve. A phonetic analysis of the speech effect under 
anesthesia revealed two factors which prompted further investigation; first was 
the variability of effect among speakers, with some subjects unaffected by the 
nerve block, although oral sensation was reported to be lost, and the second 
factor was the predominance of articulatory distortions among the sibilants and 
affricates, especially /s/ in consonant clusters^ in those subjects who were 
affected (Borden, 1971). It was decided to study electromyographically the 
contraction of some of the muscles thought to be implicated in lingual movement 
under conditions of nerve block and under normal conditions. 



Two separate electromyographic (EMG) experiments were conducted in an 
attempt to find out what happens to certain suprahyoid muscles as subjects speak 
under conditions of trigeminal nerve block. Since the nerve block seemed to 
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INTRODUCTION 



FIRST ELECTROMYOGRAPHIC STUDY 




produce an I si effect, muscles which are thought to contribute to tongue eleva- 
Jo^qn"*'* reviewed (Van Riper and Irwin, 1958; Hlrano and Smith, 1967; Zemlln, 
1968). The muscles which were accessible, clearly Identifiable, and of Inter- 
est for this study were the genloglossus, geniohyoid, mylohyoid, and the anter- 
ior belly of the digastric muscles. The orbicularis oris was Included as a 
reference (Figure 1). 

Method 

The monopolar electrodes used were DISA concentric needle electrodes with 
a diameter of .45 mm. Needle placement was made through the cutaneous tissue 
under the chin to the depth required. Correct placement was checked by observ- 
ing the oscilloscope while protruding the tongue for genioglossal activity, 
saying "ta" for geniohyoid activity, lowering the mandible for digastric 
activity, and saying "ka" for mylohyoid activity. Correct placement was checked 
periodically throughout each run. 

The subject for the first experiment was a normal adult speaker. Two runs 
were produced, the first without nerve block, and the second with bilateral 
mandibular blocks. A total of 7.5 cc of 2Z xylocalne was Injected by a 
dentist, 3 cc In each side and an additional 1.5 cc on one side. The technique 
was similar to that used by McCroskey (1958), the model for all previous studies. 
A partial run was recorded with a medial nasopalatine block of 1 cc and an anter- 
ior palatine block of 2 cc added, but this part of the study was "not analyzed, 
as the speech effects were not noticeably different from the run with the 
bilateral mandibular blocks alone. It seems that loss of sensation from the 
anterior portion of the hard palate and the alveolar ridge adds very little to 
the speech effect evidenced with the mandibular blocks. 

For the EMG studies, material was selected from the utterances used In our 
previous work. Eleven utterances in sentence form, using the format "it could 

»" were used to permit the necessary rapid connected speech. 

Each utterance was represented twice in a randomized list of twenty-two utter- 
ances. There were ten such lists, each individually randomized. Each utter- 
ance was spoken twenty times during the course of one run. The utterances were 
as follows: 



It 


could 


be 


the 


snowballs splashing. 


It 


could 


be 


the 


cat's whiskers. 


It 


could 


be 


the 


fixed sweater. 


It 


could 


be 


the 


school blocks. 


It 


could 


be 


the 


thirsty wasp. 


It 


could 


be 


the 


sleeping taxi. 


It 


could 


be 


the 


spider string. 


It 


could 


be 


the 


squirrel nest. 


It 


could 


be 


the 


rooster scratch. 


It 


could 


be 


the 


spring grapes. 


It 


could 


be 


the 


stove smell. 



The 220 utterances for each run were printed and mounted on large cards which 
were flipped as the subject read them, with equal stress attempted on each of 
the final two words. 
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A 16-channel magnetic tape was produced, recording the electrical output 
Of the muscles, which were monopolar recordings; that is, the difference was 
recorded between the active tissue of the muscles and the inactive tissue of 
the eariobe. some of the channels were used for audio signals, such as the 
utterances produced by the subject and the comments for record-keeping produced 
by the experimenters. Each utterance was numbered by a pulse code which was 
laid down on the tape and eventually on the computer output. 

The output Of the channels was put onto paper tape both at the time of the 
run and later for locating and inspecting the individual tokens. Each utter- 
ance was represented twenty times during each run, and a single point in time, 
a line-up point, was selected so that all of the tokens of a single type could 
be averaged by computer for each electrode. The line-up point was chosen at a 
point Of particular interest and marked on the simultaneous recording of the 
subject's audio recording. 

Each tape was subjected to five computer programs to check that the code 
pulses were in order, to set the gains of the playback amplifiers at levels 
appropriate for the analog-to-digital converter, to make control tapes of the 
llne-up points and distances from point zero for each utterance, to set each 
EMG channel at the optimum level, and finally to average the data on the control 
tapes . 

The paper output of this process is a list of numbers for each channel, 
indicating the averaged value of each electrode in microvolts every 5 msec. 
The three runs were hand plotted. 

Results and Discussion 

„ Inspection of the data reveals that the muscular activity recorded during 

speech under normal conditions remained high during the nerve-block condition 
^i™H ^''T '""scles. After the nerve-block injections, it was 

observed by the experimenters that the activity on the oscilloscope of the mylo- 
hyoid muscle and the anterior belly of the digastric muscle dropped dramatically 
to a state of relative inactivity. The electrodes were checked and found to be 
in place, but as long as the anesthesia was effective those muscles were in 

tJLTrl, .J^f ^^^^""^ °^ ^"^J*'^'^ n«ve block revealed the 

typical mandibular block effect of distorted sibilants, the /s/ clusters being 
most prominently affected. Compare the graph of the two affected muscles 
riJiUL'J? utterance "sleeping taxi" under normal conditions 

(Figure 2) with the graph of the same electrode placements during nerve block. 
All eleven utterances showed the same drop in activity for the mylohyoid muscle 
and the anterior belly of the digastric during anesthesia. 

A closer look at the anatomy at the injection area showed us that we should 
not have been surprised. The mandibular injection which has traditionally been 
used for these studies deposits half of the solution m the area of the lingual 
nerve, then moves on to deposit the rest of the solution in the area of the 
inferior alveolar nerve. It happens that just before the inferior alveolar 
nerve enters the mandibular foramen into the mandibular canal, it gives off the 
nerve fibers of what is known as the mylohyoid nerve, the only purely motor com- 
%r inferior alveolar branch of the trigeminal nerve 

(rigure 3; . The mylohyoid nerve is motor to the mylohyoid muscle and to the 
anterior belly of the digastric muscle, the two muscles which dropped in activity 
during the nerve-block condition. ^ 
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The next consideration was whether the inactivity of either of these 
muscles could have contributed to the noted speech deterioration. If the speech 
effect is primarily due to sensory loss, then loss of feedback from the tongue 
tip region would probably be responsible. If it is due to motor loss, however, 
then the anterior belly of the digastric muscle and the mylohyoid muscle are 
probably responsible. 

The normal function of the anterior belly of the digastric muscle is to 
open the jaw. EMG data on this muscle, obtained by recording muscle activity 
during simple "CVp" utterances, showed no action for /i/ and /u/ and a large 
peak for /a/ (Harris, 1971). Since there was no perceptible speech effect of 
the nerve block upon vowels, and since the action of the anterior belly would 
not reasonably be expected to affect the spical gestures which deteriorated 
under nerve block, it seems unlikely that its motor loss could have caused the 
speech effects observed. It may be that other mouth-openers compensate. 

The normal function of the mylohyoid muscle was found by both Harris (1971) 
and Smith (1970) to be highest for the production of /k/. Its contraction 
seems to lift the body ot the tongue. In the more complex utterances of the 
present study, it can be seen that the mylohyoid muscle peaked normally in pre- 
paration for the I si consonant clusters and for the velars (Figure 4). Notice 
the activity at the beginning of "spring," "spider," and "string," and at the 
end of "grapes" and "string." Observe the drop in activity of the mylohyoid 
muscle during the nerve-block condition. The peaks of activity under normal 
speaking conditions, then, coincided with the speech distortions produced under 
the nerve-block condition, with the exception of the velars. 

The nerve block did not distort the velars sufficiently to be perceived as 
a distortion. The production of /k/ remained intact, as had been reported in all 
previous nerve-block experiments. The explanation may lie in the comparatively 
gross production of /k/ and the fact that we, as listeners, accept as /k/ a less 
precise gesture than we do as /s/. 

It seems, therefore, that the effected "paralysis" of the mylohyoid muscle 
might reasonably be related to the speech effect, since, for this subject, the 
mylohyoid muscle appears to be important in lifting and steadying the body of 
the tongue for consonant clusters, especially those with /s// (Table 1). This 
subject produces /s/ with the tongue tip down, making it imperative that the 
body of the tongue be raised to produce the friction. Deprived of motor ability 
in the mylohyoid and deprived of lingual sensation, the I si clusters were dis- 
torted. It is impossible to conclude which of these factors, if not both, is 
responsible for the distorted speech, but it cannot be assumed, as it has in 
previous studies, that the effect is due to loss of sensory feedback. 

In summary, the clear conclusion of this first EMG experiment was that a 
motor component existed in what was previously assumed to be a sensory depriva- 
tion. The motor loss was evident in two of the suprahyoid muscles, the mylo- 
hyoid muscle and the anterior belly of the digastric muscle. One of these 
muscles, the mylohyoid, is normally active for this subject for /s/ clusters 
and velars. Since this subject produced I si with a high dorsum, it is reasonable 
to assume that the motor loss in the mylohyoid muscle may have contributed to the 
speech deterioration during anesthesia. 



33 



TABLE 1: 



Peak values in microvolts for mylohyoid muscle in first 
EMG experiment during nerve-block and normal conditions. 



sphringrapes 
Normal 345 155 285 
NB 30 35 20 

(-225) (125) (715) 



msec 



catswfaiskjers 
Normal 315 355 380 370 
NB 35 40 40 20 

msec (-800) (-505) (-140) (200) 

• 

thlr8tywa8£) 
Normal 185 310 
NB 30 35 

msec (-855) (-ii^:*) 

• 

stjovesmell 
Normal 335 355 
NB 30 50 

msec (-215) (325) 

• 

sn f owfaal lssplashl ng 
Normal 415 340 430 

NB 30 55 25 

msec (-140) (500) (900) 
(/ng/ not plotted) 



ro oster sdratch 
Normal 175 200 310 370 
NB 30 40 40 20 

msec (-775) (-440) (-125) (325) 

• 

fl lxed sweater 
Normal 485 210 140 
NB 45 45 15 

msec (45) (325) (585) 

• 

scfajsolblocks 
Normal 380 400 
NB 50 30 

msec (-145) (640) 

• 

sqjii rrelnest 
Normal 215 150 
NB 50 25 

msec (-175) (635) 
* 

spjlde rstring 
Normal 355 300 210 
NB 35 40 25 

msec (-210) (365) (790) 



sljeep ingta xi 
Normal 425 265 355 
NB 30 40 40 

msec (-155) (300) (635) 



SECOND ELECTKOHYOGBAPHIC STUDY 

The purpose of the second EMG study was to verify the result of the first 
study, which was that mylohyoid motor loss accompanied the distorted speech 
during the nerve-block condition, and also to study further the changes in 
muscle activity by comparing the electrical potential in normal speech with 
the electrical potential during nerve block. 

Method 



0 
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It was necessary to use a second subject for this experiment. The material 

consisted of thirty utterances in the frame "the They were randomized 

lnt( four lists repeated alternatively four times, making sixteen lists of 
Tli'^nl Fifteen of the utterances were chosen from the Scott 

zVu ] attempt to observe the muscle changes in the distorted speech 

which might expUin the phonetic changes which she had transcribed. The other 
fifteen utterances were selected from the sentences in the first study and from 
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fi?8?oS?'^8 l^lit.T" J""' Done on the same tnornlng. the 

dltlon 'conducted under normal conditions, the second under blocJ^d con- 



mouth with jaw effort for InS;4«J u ?i c !? mylohyoid activity, opening the 

ior bellies of the digasJrJc musJJe! mylohyoid muscle and in both anter- 

the oJfrregJon"ortie"ub1ec^°'*iH°' " °' '^y'*'"^"* into 

nerve blocks Ixl^ lyl . .t^ Injections used In this study »ere 

u... ij;j?r- A-^zL^roHhe'iJScELir.i'^^'i^'^crJtJ 

«nt.J/°"!'' '1;?''' "^-POl"' dlscrlBlnatlon vas made, .„d when the e>n>etl- 

"e^L?^ er's"tJ?6«'flf« 

* ^"8®^ ^ ^1^09) fifty-five-item oral discrimination test of ten 

SSItlon ^f^' Confusion ot shape occurred three tl^s Z^l 

:ro^tVsJi:£^^^^^^^^^ " r-the. 

were aiaCefjS^i T"*"' "'^''^'^ P"^""^ of these runs 

were analyzed in much the same way as the first experiment. There were some 



SnJ?™!^;-'''*^ extensive injections were administered was to enable the 

SI? t— ^^^^^ - - C-Tne^r-lld-l; aSlf 

The intent was to produce a purely sensory block without any motoi effects 
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TABLE 2: Injections of anesthesia administered in the second EMG study. 
Branch 



Cranial 
Nerve 



Amount of 
Solution 



Location of 
Injection 



Area of 
Sensation 



V (mand.) Inf. Alveol. n. 
Lingual n. 



1.5 cc ea. side pterygomand. 

triangle 



mand. alv. 
ridge, lip, 
gums 

ant. 2/3 
tongue 



V (mand.) Long Buccal n. 



.5 cc ea. side 1st molar 



buccal 



V (max.) 



Infraorbital 
Ant. Su^- Alv. 
Middle I . Alv. 



•5 cc ea. side Infraorbital 
foramen 



V (max.) Nasopalatine n. 



•5 cc midline post, to 

central 
Incisors 



upper lip 
alv. ridge 
ant. teeth 



ant. 1/3 
palate 



V (max.) Post. Sup. Alv. n. .5 cc ea. side 2nd molar 



molars 



V (max.) Greater Palatine n. .5 cc ea. side palate 3rd mol, 



post. 2/3 
palate 



refinements In the computer programs. A concise description of the analysis 
procedure is reported by Port (1971). 

Results and Discussion 

As a result of the first EMG experiment, the Investigators were particu- 
larly Interested In this second study In the activity of the mylohyoid muscle. 
Since there were bilateral placements of electrodes In both the mylohyoid muscle 
and the anterior belly of the digastric muscles, the Investigators had an oppor- 
tunity to study the activity on both sides of these muscles. During the normal 
run, before the Injections of anesthesia, the mylohyoid and the anterior bellies 
showed activity similar to the first subject. The anterior belly peaked for 
mouth opening and the mylohyoid for velar gestures and somewhat for the /s/ 
clusters. 

During the condition of nerve block, however, there was a decrease In activ- 
ity In both muscles on the right side. The right anterior belly of the digastric 
was In all cases significantly less active than normal after anesthesia. The 
right mylohyoid was consistently less active than normal for velar gestures, but 
for the /s/ clusters. It was sometimes less active and sometimes more active than 
normal. The decreased activity on the right side In this experiment was not as 
pronounced as It had been In the first EMG study. Indicating that the attempt on 
the part of the dentist to avoid the motor mylohyoid nerve was partially success- 
ful. The limited effect on the right side was presumed by the Investigators to 
be the result of some Infiltration of the anesthetic In the area of the mylohyoid 
nerve. 
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In contrast with the decreased activity observed on the right side of the 
mylohyoid and anterior belly of the digastric muscles, the left side of these 
muscles were usually more active than normal while the anesthesia was in effect. 
Figure 5 demonstrates the asymnetry of effect. The right peak in each of the 
four graphs represents the labial closing for /p/ in "duckpond." It can be seen 
that the right side of both muscles was quite active during normal speech but 
dropped in activity during speech with nerve block. The left electrode placement 
in the nylohyoid was in a slightly less active field than the right side. That 
is, there were fewer motor units firing near the electrodes on the left side. 
The left-side placement of the electrode into the anterior belly of the digas- 
tric was in a particularly inactive field. The problem of electrode placement 
into a more or less active field of the muscle is less Important in this study 
than in many, because our interest is in comparing the activity recorded at a 
single site under two different conditions, normal and nerve block. Relative 
values, therefore, are more Important than absolute values. A final look at 
Figure 5 shows both muscles on the left side to be more active during nerve block 
than they were normally. 

We have no explanation of these results except to assume that the anesthetic 
solution had a motor effect on one side of the subject and that there was some 
reorganization of motor function on the opposite side to compensate for the motor 
loss. Typically, bilateral injections of anesthesia result in some asymmetry of 
effect. In the perceptual study we sometimes had to reinject a subject on one 
side, due to insufficient loss of sensation. The subject for the first EMG study 
required an additional 1.5 cc of xylocaine on one side to equalize the desen- 
sitivity. It is reasonable to assume, therefore, that there would be the same 
possibilities for asymmetry of motor effect, depending upon the amount of infil- 
tration of the anesthetic solution into the fibers of the motor mylohyoid nerve. 

The most prominent result of this study, therefore, was that despite con- 
siderably less anesthesia and an attampt to avoid the mylohyoid nerve, there was 
a unilateral drop in mylohyoid and anterior belly of digastric activity during 
anesthesia, although the other apparently unaffected side demonstrated efforts 
at compensation, by showing more than normal activity. 

A second interesting result of this EMG experiment was that the subject's 
articulation appeared to be clear under nerve block. There were no discernible 
phonetic distortions. The speech sounded as acceptable under the nerve-block 
condition as under the normal condition. The utterances were louder under nerve 
block and produced with what might be described as overarticulation. 

This variability of nerve-block effect among subjects was observed during 
the perceptual part of this series of studies. It is unclear why there was no 
speech effect. It might be a difference in muscle use, as this subject produces 
/s/ with tip of the tongue raised and might not rely on mylohyoid muscle activ- 
ity as much as the first subject, who produces I si with dorsum of the tongue 
raised, keeping the tip down. Another explanation for no speech effect might be 
a difference in anesthesia, either in amount or in technique of injection. It 
is customary in these studies to inject anesthesia until the subject reports 
loss of sensation. In the mandibular block, loss of sensation is reported 
Immediately when the lingual nerve has been hit directly, as it was in the case 
of this second subject. Only 1.5 cc of xylocaine solution was injected into 
each side, whereas 4.5 cc in each side was necessary before the subject of the 
first experiment lost sensation. The solution presumably anesthetized the 
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mylohyoid nerve of the first subject, as we have indicated mylohyoid muscle 
and anterior belly of the digastric muscle inactivity, in this subject there 
was less anesthesia needed to effect loss of sensation and the solution 
apparently did not penetrate the mylohyoid motor nerve fibers on one side. 

The third result of the second ms study was a fairly consistent pattern 
of muscle reorganization under nerve block. Table 3 summarizes the muscle 
activity in general for each utterance during nerve block as it compares to its 



TABLE 3: Relative muscle activity during the nerve-block con- 
dition for each utterance. 





More Active 
Than Normal 


Less Active 
■ Than Normal 


Same As 
Normal 


Different 
Than Normal 


00 


11 


5 


13 


1 


GG 


1 


8 


21 




GH 




29 


1 




SH 


22 




7 


1 


MHR 


4 


14 


7 


5 


MHL 


24 




6 




ADR 




30 






ADL 


15 




15 





own activity nomally. The orbicularis oris was usually either the same as 
normal or more active than normal. The genioglossus tended to be the same. 
Inexplicably, the geniohyoid was less active during nerve block. The rest of 
the muscles follow a reasonable pattern of adjustment. The right side of the 
nqrlohyoid muscle and the anterior digastric lost activity during nerve block, 
as previously discussed. The scatter plot of the differences in peak values of 
the right ^terior digastric during the two conditions is clear. It was always 
higher normally than during anesthesia (Figure 6). The right mylohyoid showed 
the same decreased activity for the normally active velar gestures, but for the 
high front gestures such as /t/ or /s/ clusters, there was increased activity 
during nerve block (Figure 7). 

Shifting our attention to the left mylohyoid and anterior digastric, we 
see that again the anterior belly clearly increases activity during nerve block, 
perhaps as compensation for the less active left side (Figure 8). The left 
mylohyoid, however, is somewhat more complex. It, too, was more active during 
nerve block. Notice that for the less active front consonant gestures, there 
was less increase in activity during nerve block than for the already normally 
active velars (Figure 9) . When the right side of the mylohyoid dropped for the 
velars, the left side soared. Finally, the sternohyoid was interesting as it 
was consistently more active during nerve block than under normal conditions and 
might reasonably be expected to coiq^ensate for the inactivity of the anterior 
digastric. The anterior belly of the digastric opens the jaw, as does the 
sternohyoid (Figure 10). In summary, the muscles do seem to be behaving 
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sublect. reaalns iinclmr t. seemed to be the case with the first 

to ; lois S^uctU. !S"ktaLrt!»''"°'' ''«"^""''°. "l-'n It «tlsts. relsted 
euMested? Oril.it JTi. ? ? "ensatlon, as has tradltlonaUy been 

Izatlon of the unaffect-d l.Li-I ? * possibly be related to some reorgan- 

an attempt SlcHer^Js 8^^^^^! /"eapt to compensate for the motor loss. 

clslon^f gSture'^su^S'^as /8/ ^f? °" demanding rapidity and pre- 

geaLure sucn as /s/ and /r/ in consonant clusters? 
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Laryngeal Control in Vocal Attack: An Electromyographic Study 



Hajlme Hlrose* and Thomas Gay** 
Raskins Laboratories, New Haven 



SUMMARY 

tbiltlchannel EMG recordings were obtained from the Intrinsic laryngeal 
muscles of four American English speakers for three different types of vocal 
attack: breathy, soft, and hard. The data were processed by a digital 
computer to obtain an average Indication of overall muscle activity. 

The results Indicate that the three different types of vocal attack are 
characterized by coordinated actions of the abductor and adductor muscles of 
the larynx, and further, that these muscles work In reciprocal fashion for 
each type of attack. 

INTRODUCTIOM 

The mechanism of laryngeal control for different types of vocal attack^ 
has long been a subject of Interest In the fields of laryngology and experi- 
mental phonetics. Three types of vocal attacks are generally recognized: 
(1) breathy or aspirate, (2) soft or simultaneous, and (3) hard or glottal. 

Various experimental techniques have been used to Investigate these types 
of vocal attack— high-speed cinematography (Moore, 1938; Wemer-Kukuk and 
von Leden, 1970), aerodynamic study (Isshlkl and von Leden, 1964; Koike, 1967; 
Koike et al., 1967), and electromyography (EMG) of the Intirlnslc laryngeal 
muscles (Gay et al.. In press; Hlrano, 1971; Koike, 1967; Sawashlma et al., 
1958). Among these, electromyography is particularly useful, as It provides 
the most direct Information on the actions of the Individual muscles responsible 
for vocal attack. 

Most of the previous studies In laryngeal physiology generally support the 
classical division of the Intrinsic laryngeal muscles Into three functional 
groups: abductor, adductor, and tensor. However, there still are many 
unanswered questions concerning the function of Individual laryngeal muscles In 
different modes of laryngeal adjustment. 



*0n leave from Faculty of Medicine, University of Toyko. 

**Also University of Connecticut Health Center, Farmlngton. 

^The term "attack" usually refers to vocal Initiation In singing; If we use 
this classification to refer to speech utterances consisting of /C + V/ 
sequences, breathy attack should be equivalent to the utterance Initiated 
with /h/ , soft attack to that with voiced consonant, and hard attack to that 
with glottal stop. 




The first EMG study of vocal attack was attempted by Faaborg-Andersen 
(1957) , who compared the activity of the vocalls and cricothyroid in the 
production of /hop/, /bop/, and /op/, representing breathy, soft, and hard 
attack, respectively.2 He stated that the time interval between the start of 
the increase in activity in the two muscles and the onset of the tone (At) was 
greater for hard attack than for either the breathy or soft attack in both 
muscles. 

Koike (1967) later examined the EMG activity of the vocalls and the 
cricothyroid in his extensive study of vocal attack and claimed that At was 
largest for hard attack but that values were variable for soft and breathy 
attacks. He also claimed that the amplitude of the pre-phonatory activity of 
these two muscles seemed to serve as a more reliable index for differentiating 
the type of vocal attack than At values. 

Hirano (1971) repeated the first two studies using trained singers who 
were asked to begin phonation on a signal. He was unable to distinguish At 
values for the three vocal attack conditions. He suggested, rather, that the 
mode of activity of the adductors (the lateral cricoarytenoid and vocalis in 
this case) of the larynx, particularly during the pre-phonatory period, is the 
most essential factor for differentiating the type of vocal attack. 

These previous EMG reports dealt solely with the adductor and the tensor 
groups of the larynx, and no attempt was made to clarify the participation of 
the abductor muscle, the posterior cricoarytenoid, in vocal attack. Further- 
more, most previous studies were based on the observation of limited numbers of 
raw EMG traces. It would seem reasonable, then, that a detailed, systematic 
EMG study of all the intrinsic muscles of the larynx is needed to provide a 
complete description of the muscle control mechanism of vocal attack. 

The purpose of the present study was to investigate systematically the 
actions of all the intrinsic laryngeal muscles in different types of vocal 
attack. Particular attention was directed to comparing the temporal character- 
istics of the EMG activity patterns for the abductor and adductor muscles. 

PROCEDUBES 

Subjects 

The subjects were four adults, three male and one female, all native 
speakers of American English. The female subject (AP) was a trained singer. 

For each subject, an attempt was made to record from the five intrinsic 
muscles simultaneously. However, this goal was reached for only two of the 
four (LJR and LL). Unsatisfactory recordings were obtained for the posterior 
cricoarytenoid and the cricothyroid of one subject (TG), and the posterior 
cricoarytenoid, the Interarytenold, and the vocalls of another (AP). 



2^ 

These test utterances can be transcribed phonetically as (hop), (bop), and 
(?9P)» respectively. 
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Recording and Processing of Data 



Conventional hooked-wire electrodes consisting of a pair of Insulated 
platlnum-lrldlum alloy wires with a short hook at the tip were used in the 
present experiment (Hlrano and Ohala, 1969). The wires were threaded In a 
hypodermic needle and Inserted Into the muscle with the needle. After the tips 
of wires were located In the muscle, the needle was withdrawn, leaving the 
wires In place. 

The electrodes were Inserted percutaneous ly through the skin of the 
anterior neck Into the lateral cricoarytenoid (LCA), the vocalls (VOC), and the 
cricothyroid (CT), while the Insertions Into the posterior cricoarytenoid (PCA) 
and the Inter arytenoid (INT) were made perorally by Indirect laryngoscopy under 
topical anesthesia. A specially designed curved probe was used for the peroral 
Insertions. 

The basic data*processlng procedures followed In the present experiment 
were to collect EMG data for a number of tokens of each type of vocal attack 
and, using a digital computer, average the integrated EMG signals at each 
electrode position. EMG data were recorded on a multichannel tape recorder 
together with the acoustic signal and digital code pulse (octal format). This 
pulse was used for identifying each utterance for the computer during process- 
ing. In the present experiment, the line-up point for averaging was the onset 
of voicing of each utterance. A more detailed description of both the data- 
recording and data-processing techniques can be found elsewhere (Gay et al., in 
press; Hirose, 1971; Hirose et al,, 1971; Port, 1971). 

Experimental Conditions 

Isolated monosyllabic words /ha/, /ba/, and /?a/ were used to represent 
breathy, soft, and hard attacks, rer^»e2tlvely . The subjects were required to 
repeat each test utterance sixteen ti'.es. Vocal intensity and frequency were 
kept at normal levels. 

RESULTS 

Figure 1 shows the averaged EMG curves of the five intrinsic laryngeal 
muscles for subject LJR. It is clearly demonstrated in this figure that the 
pattern of activity of the individual laryngeal muscles differs depending upon 
the type of vocal attack. 

In breathy attack, PCA stays active throughout the pre-phonatory period 
up to the point immediately before the onset of voicing (in this example, the 
activity starts to decrease approximately 150 msec before the onset of voicing). 
Its activity then decreases steeply and remains suppressed during the period of 
voicing. Conversely, the activity of the other four muscles appears to be 
suppressed during the pre-phonatory period and then Increases steeply when PCA 
activity begins to decrease, peaking at about the time of voice onset. 

In hard attack, PCA activity decreases well before the onset of voicing. 
PCA then shows a transient lncre<<se in activity just before the onset of 
voicing after which it is suppressed again for the period of voicing. The 
adductors, LCA in particular, show a very characteristic pattern of activity 
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for hard at tack • LCA activity Increases markedly long before (In this example 
more than 700 msec prior to) the onset of voicing and stays high during the 
pre-phonatory period. It then shows a steep fall Immediately before the onset 
of voicing, followed by a less pronounced rise for the voicing period. INT, 
VOC, and CT also show activity during the pre-phonatory period followed by a 
fall at approximately the onset of voicing. 

In soft attack, PCA activity Is suppressed throughout the pre-phonatory 
period. The general pattern of activity of PCA In soft attack Is similar to 
that in hard attack, except that there is no temporary Increase before the 
onset of voicing. The activity of the adductors and CT increases gradually, 
reaching a peak after the onset of voicing. 

The pattern of activity of the individual laryngeal muscles examined in 
the other three subjects was essentially similar to that observed in the first 
subject. 

Figure 2 illustrates the averaged EMG curves of a second subject (LL) for 
three types of vocal attack. The temporal characteristics of PCA activity of 
the second subject are quite similar to thosie of the first subject with respect 
to the following points: (1) in breathy attack, PCA stays active throughout 
the pre-phonatory period up to the moment immediately preceding the onset of 
voicing after which it shows a steep fall; (2) in soft and hard attack, PCA 
activity starts to decrease well before the onset of voicing; for hard attack 
there is a transient increase before the voice onset, while for soft attack, it 
stays suppressed throughout; (3) PCA activity is higher for the pre-phonatory 
period than for the period of voicing regardless of the difference in the type 
of vocal attack. 

When we compare the temporal characteristics of adductor activity of the 
second subject to those of the first subject, it is observed in both cases that 
adductor activity in breathy attack remains suppressed during the pre-phonatory 
period and then increases steeply for initiation of voicing. In hard attack, 
LCA shows a marked increase in activity during the pre-phonatory period in the 
second subject too, although the timing of the onset of the increase is somewhat 
later than that in the first subject. In soft attack, the adductors show a 
gradual increase in activity toward initiation of voicing. In the second 
subject, however, the increase starts earlier for soft attack than in the case 
of breathy attack. 

Figures 3 and 4 show the averaged WG curves for subjects T6 and AP, 
respectively. In subject T6, all the adductors show more or less similar 
temporal characteristics for each type of vocal attack. In hard attack, in 
particular, both INT and VOC also showed considerable pre-phonatory activity 
followed by a fall immediately after. There is a general tendency of gradual 
Increase in activity in soft attack. It is noted in breathy attack that there 
is temporary suppression of activity preceding the steep rise for initiation of 
voicing for all the three muscles. 

. The temporary dip in activity just before steep rise in breathy attack is 
also found in subject AP, both in LCA and CT. The general pattern of LCA 
activity of subject AP for each type of vocal attack is essentially similar to 
that of subject TG» though the activity increases more steeply near the voice 
onset, both for breathy and soft attacks. 
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DISCUSSION 



^ddu^^l ''Tf^ ^Ttl ^^^"^ coordinated actions of the abductor and 

adductor muscles of the larynx characterize each type of vocal attack. 

In breathy attack, PCA shows a characteristic pattern during the nre- 
phonatory period, where It stays active until Just JeforrtheonseJ of '"icing. 
vJn^^den'lS^Sn^d ll Jj^' ^^^J-P^' ^^^^^-^^ography (Wemer-Kukuk anr ' 
ITottlt lLllll / fiberoptic observation (Sawashlma. 1968) that the 

u production of Initial /h/. Presumably, the 

actJi^: L\'£ conjunction with low';dductor 

Ji S! physiological correlate of the open glottis for Initial /h/ 

IZIILJ^LT \ " ^"^'^ However. It shows a steeper 

Increase near voice onset In breathy attack. Hlrano (1971) stated that the 

T « temporary dip In activity preceding the steep rise In 

il^no«4 5? /"rrf"*^"!'. P'"*"^ *PP««« that there Is a 

temporary dip In LCA activity In breathy attack In subjects TG and AP (see 

< V' ''"'J^* ""'^ consistent for the others. The dip 

may well be Interpreted as a temporary suppression of adductor activity for the 
p^^arlto^"! ^ ' f'"' increasing slightly In order to^^l^talJ 

a"StSn ?n the T""' e^i^t between singing and speech 

s^DDrS^JJn fir P'«P«*tory muscle tonus as well as of temporary 

vZe o? ?St ^{ father that the maximum , 

value of INT activity Is higher in breathy attack than In the other two attacks.^ 
It is generally agreed that EMG activity grossly represents the muscle acJJ^ 
UcZll?. f" "8 ^orce or displacement, although either 

Isometric or Isotonic conditions In a strict sense are hardly expected In realltv 

^ara':?wf I'^'TT ^f"^f P"">' INt'is consider^ to 

f™L?J!iv ''f ? adduction during speech. Since glottal width 

S^^H L'' "k?° obviously larger In breathy attack. It 

should be reasonable to expect that In order to accomplish the larger displace- 

S^"Jhf°^tJir ? 5'*..*^^°""^/^'" -"«t necessarily be higher. 

On the other hand, the activity of the LCA or the VOC Is not always higher for 
breathy attack than for soft attack. This would suggest that these tT 

Zc lons"suS °' '"^ ^"^^ Sa" aSritlonal 

functions, such as supplying medial compression or tension to the vocal folds. 

^atiJf^t^ t^/J^^'^terlze hard attack Is the temporal pattern of LCA 

l^lJt Jl. the pre-phonatory period 

Tollll Itil"" I f '° compression or constriction of the 

glottis prior to release. A steep fall In LCA activity accompanied by a brief 

aSr!nt 1 ^''T''" "T"' '° physiological LchanJJm controlling 

abrupt glottal release after the period of constriction. In subject TG. VOC and 
INT also appear to contribute to strong constriction of the glottis during the 
pre-phonatory period. * aurxuB tne 

It Is characteristic In soft attack that PCA activity decreases gradually 
toward the onset of voicing, while the adductors appear to show gradual Increase 



^Subject TG showed highest INT activity for hard attack during pre-phonatory 
period. However. INT activity for voicing appears to be highest In breathy 



57 



in activity for initiation of voicing. In the present study, the test 
utterance which was used for soft attack was initiated by /b/. It is conceiv- 
?o y'rSe^L e";n'Jr.J'* vocal folds hardly start vibrating before':rtJcular 
tory release even if they are adducted near to the midline. There mav however 

Jhe^uttera^nc'^'L'? .'m "L^'k" P*"*" <iepenJJ:g"crjhe?h°:r ' 

the utterance is initiated by a voiced consonant or a vowel. 

m vn^lia*^' r*"?"? °" activity of the intrinsic laryngeal muscles 

? " ^P***"^* P'*""'^ ^"•^^"^ reported that PCA and INT 

GaJ Jn nr2«r T?*"*™ °' f"'^''^ ""'"^ °f «P««<^h (Hirose and 

Gay in press). It was revealed that PCA shows marked activity for the production 
t u^*" consonant, while INT is suppressed. Conversely. INT generally 

s iSi'pro^IlL ' '^%P"?-tion of vowels and voiced'consonfnt" ^Sle 
PCA is reciprocally suppressed. It was further revealed that the other 

wnen compared with INT. Namely. LCA and VOC showed increasing activity for 
orSe^'J^JcS "hile appearing inactive for consonant production regardless 
of the voiced vs. voiceless distinction, at least in that particular context. 

actlv^v'Jn * different pattern of 

suMeJj TC ^''fi* 5":? LJR and LL. In 

subject TG. on the other hand, the three adductors show more less similar 

temporal patterns of activity, in which participation of INT in tight glottal 

closure in hard attack appears more dominant than in the other subjectf! 

LCA difference between the activity patterns of 

LCA and VOC in either of our two studies. The present data suggest there is no 
qualitative but perhaps some quantitative difference in their activity patterns. 
Wtlli^H^r"**^:?* Hirano et al. (1970) have suggested that the two mJscles 

w^^^^^^ f r*^"" differences 

between their results and ours may be due to the different tasks of the two 
groups of subjects. In any event, further study on various vocal maneuvers is 
needed to determine any possible functional differentiation within the adductor 
muscle group of the larynx. -uautuot 

1^ ?! considered as a prime pitch raiser acting by tensing the 

Simada and Hirose. 1971). Girding et al. (1970) reported that there is apparent 
antagonism between VOC and CT in the production of a glottal stop in one Sf 

f ^^^Z"''^*"?*/" "^^'^^ " activity appeared to be suppressed at 
the moment of maximum activity of VOC for the period of glottal closure. However, 
tneir data might not be comparable to the present data because the test 
utterance used in that particular experiment included variations of word accent 
o? cJ"-^M«J?v ^ ^J'^J'^^"" 8lo"al stop productions. The apparent suppression 
of CT activity in their data can be correlated to the falling in pitch toward 
the period of glottal closure. The present study revealed that CT shows more 
or less similar patterns of activity with VOC in subject LJR md LL and with LCA 
in subject AP in respect to the difference in the type of vocal attack when 
pitch is not changed. 

In the present study, the measurement of so-called At was not attempted 
because of the ambiguity in defining the onset of EMG activity relative to the 
onset of the acoustic representation of voicing. For example, it is not unusual 
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to observe in raw EMG records of the intrinsic laryngeal muscles a good amount 
of continuous resting discharge even during the period of silence between 
utterances. Thus, it appears difficult to define the onset of EMG activity 
from the raw EMG traces and, as a result, it is difficult to specify general 
rules for these measurements. 

On the other hand, the temporal patterns of averaged muscle activity 
present in this paper certainly give no less information than simple comparisons 
Of At and can be considered as more appropriate for comparisons of such activity 
patterns. ^ 

As shown in the previous figures, it is generally observed that laryngeal 
muscle activity, except for that of PGA, starts to increase earlier for hard 
attack than for the other two. This confirms findings reported in previous 
reports. However, what seems to differentiate the various types of 



" "■••'■'■«:'•«:»"- itii.e i-ne varxous types of 

attacks is not simply the pre-phonatory activity of the adductors but rather 

activity patterns of all the intrinsic laryngeal muscles. In other 
words, different coordinated actions of the intrinsic laryngeal muscle systems, 
working in reciprocal fashion, determine each type of vocal attack. 
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A Parallel Between Encodedness and the Magnitude of the Right Ear Effect 

James E. Cutting* 

Hasklns Laboratories, New Haven 



Early studies In dlchotlc listening (Broadbent, 1956; Klmura, 1961) 
presented different digits simultaneously to each ear. The results showed 
that this task overloaded the perceptual system, and numerous errors occurred. 
The errors however, were differentially distributed; more errors occurred In 
recalling digits presented to the left ear than to the right. The superior 
performance of the right ear over the left Is known as the right ear effect 
and has been explained. In part, as a reflection of linguistic capabilities 
of the cerebral hemispheres. In the dlchotlc situation It appears that 
linguistic Information can best travel the path from the right ear to the 
left hemisphere (see Studdert-Kennedy and Shankweller, 1970). We have known 
since the mld-nlneteenth century that the left hemisphere of the brain Is 
primarily responsible for language functions. Nevertheless, It was not known 
what aspects of dlchotlc stimuli were responsible for the right ear effect, 
ralred digits dlffet In duration, phonemic encodedness, syllabic form, and 
many other aspects. Any one of these differences might have been responsible 
for the ear effect. 

Shankweller and Studdert-Kennedy (1967) showed that the right ear effect 
was closely related to certain parts of the sound pattern of speech, but not 
5° ? "f • of stop consonants In dlchotlc consonant-vowel 

CCVJ syllables showed a large right ear effect. The Identification of steady- 
state vowels, on the other hand, showed no significant ear effect. 

Other classes of phonemes have been tested dichotically and appear to 
show results which are intermediary between stops consonants and vowels. 
Liquids and semivowels (Haggard, 1971) have been shown to give a right ear 
ettect, but the magnitude appears to be smaller than that usually found for 
stops. Fricatives (Darwin, 1971) have been shown to give a small right ear 
effect when they have formant transitions, but no ear effect when the trans- 
itions are removed. 

A possible synthesis of the results of these studies is to propose an 
ear-effect continuum which parallels an encodedness continuum (see Day, in 
press). Liberman et al. (1967) have used the term "encodedness" to describe 
the amount of acoustic restructuring a phoneme undergoes in various speech 
contents. Highly encoded phonemes (e.g., stops) undergo considerable change 
in their acoustic form as a function of their environments, whereas less en- 
coded phonemes (e.g., fricatives, vowels), on the other hand, undergo little 
change. Thus the phonemes might be arrayed in parallel along an encoaef:,.Ps 
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continuum and an ear-effect continuum in the following manner: stop conson- 
ants are the most highly encoded phonemes and generally give the largest right 
ear effects In dlchotlc listening; liquids and semivowels are less encoded 
than stops and generally show smaller right ear effects; fricatives are less 
encoded than liquids and generally show a small right ear effect; and vowels 
are the least encoded of the phoneme classes and usually show no ear effect. 

Thus far the existence of an ear-effect continuum and any parallel It 
might have with an encodedness continuum have been only speculative. No 
study has tested the various phoneme classes with the same group of subjects 
and made the appropriate comparisons. The present study attempts to make 
these comparisons using stops, liquids, and vowels combining them within CCV 
nonsense syllables. 

GENERAL METHOD 

Stimuli . Eight consonant-consonant-vowel (CCV) syllables were prepared 
on the Hasklns parallel resonance synthesizer. There were thtee phoneme 
classes within each syllable: stops, liquids, and vowels. Each phoneme 
class was represented by two phonemes: /g/ and /k/ were the stops; 111 and 
/r/, the liquids; and It I and /»•/, the vowels. All possible combinations 
were used: /gig, klc , grc, krc, glac, klag, grM, kra«/. The stimuli were 
455 msec In duration and had the same falling pitch contour. The duration 
of the formant transitions In the stop + liquid clusters was 210 msec fol- 
lowed by 245 msec of the steady-state vowel. 

Subjects. Sixteen Yale undergraduates served as subjects In both experi- 
ments. They were all right-handed native American English speakers with no 
history of hearing trouble. Subjects were tested in groups of four, with 
stimuli played on an Ampex AG500 tape recorder and sent through a listening 
station to Grason-Stadler earphones. 

EXPERIMENT I: IDENTIFICATION 

A brief identification test was run to assess the quality of the stimuli. 

Procedure. The subjects listened to two tokens of each stimulus to 
familiarize them with the synthetic speech. They then listened to a binaural 
identification tape of sixty-four items. Each of the eight stimuli was pre- 
sented eight times in random sequence with a three-second interstimulus interval. 
Subjects were asked to identify each stimulus, writing their responses using 
the following orthography: GLEH, KLEH, GREH. KREH, GLAA, KLAA, GRAA,.KRAA. 

Results. The stimuli were highly identifiable. Subjects correctly 
identified the stimuli on more than 97Z of the trials. 

EXPERIMENT II: EAR MONITORING 

Tapes and Procedure . The same eight stimuli were used; however, this 
time, instead of presenting one stimulus at a time, two stimuli were presen- 
ted simultaneously, one to each ear. Dlchotlc tapes were prepared using the 
pulse code modulation system (Cooper and Mattingly, 1969). Each stimulus 
was palDid with all other stimuli, but not with itself. There were 112 
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dlchotlc items per tape: (28 possible pairs) X (2 channel arrangements per 
pair) X (2 replications). Two Puch tapes were prepared with different random 
orders. Both tapes had a four-second interval between pairs. Subjects 
listened to two passes through each 112-item tape for a total of 448 trials. 
They were told to listen to one ear at a time and to write down which of the 
eight stimuli they heard presented to that ear. The order of ear monitoring 
was done in the following manner: half the subjects attended first to the 
left ear for a quarter of the trials, then to the right ear for half the 
trials, and finally back to the left ear for that last quarter (LRRL) . The 
other half of the subjects attended in the reverse order (RLLR) . There was 
a brief rest between blocks of 112 trials. The order of the ear monitoring, 
the order of the channel-to-ear aesignments, and the order of the tapes were 
counterbalanced across subjects. 

Results and Discussion 

There are two major levels at which to analyze the data, the syllable 
level and the phoneme level. 

Syllable level. Although the subjects were familiar with the eight 
stimuli many errors occurred in reporting the correct syllable in the monitored 
ear. A syllable was scored correct when all three phonemes were correctly 
reported. Overall performance was 58% correct. Subjects performed signifi- 
cantly better when they monitored the right ear than when they monitored the 
left [F(l,15) - 20.96, p < .001]; they were 62% correct in reporting the 
syllable when they attended the right ear and only 53% correct when they 
attended the left, a net 9% ear difference. 

Phoneme level. Since there was a stop, a liquid, and a vowel in each 
stimulus, The syllable can be parsed to look at the overall performance and 
ear effects for each phoneme class. 

If we consider each phoneme as a stimulus, there are two types of trials, 
contrast trials and identity trials. Considering the stops, there are those 
trials in which the two stimuli share the same stop, for example GREH/GLAA 
and KRAA/H-AA, and those vrtiich have different steps, for example GREH/KLAA 
and KRAA/GLAA. The first type of trial is a stop- identity trial, the second 
type is a stop-contrast trial. (Note that when considering a given phoneme 
class, we temporarily disregard the other phoneme classes.) There are also 
two types of liquid trials. aEH/IOAA is a liquid-identity trial and GLEH/KRAA 
^® * liquld'contrast trial. Vowels may be treated in the same manner; there" 
are vowel-identity trials (e.g., KLAA/GLAA), and vowel-contrast trials (e.g., 
KLAA/GLEH). It is on the contrast trials that most (92X) of the errors occur- 
red, and it is those which we will discuss first. 

Contrast trials . First consider the stops. Subjects were 66% correct 
in reporting the stop in the monitored ear. There was a large, significant 
right ear effect [F(l,15) - 22.55, p<.001]: subjects were 72% correct in 
reporting the stops when monitoring the right ear and only 60% correct when 
monitoring the left, a net 12% difference. Eight of the sixteen subjects 
had significant right ear effects, and none had significant left ear effects 
as shown in Figure 1, These results were calculated using a chi square index 
discussed below. 
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Figure 1: Distribution of subjects' ear effects for the three phoneme 
classes, calculated by the chl square Index. 



The liquids showed a pattern similar to the stops. Subjects correctly 
identified the liquid in the monitored ear on 64% of the trials. Again there 
was a significant right ear effect [F(l,15) - 13.33, p < .005] , but somewhat 
smaller than that for the stops: subjects were 68% correct in reporting the 
liquid when monitoring the right ear and only 59% correct when monitoring the 
left, a net 9% difference. Figure 1 shows that, unlike the stops, only four 
subjects had right ear effects, but again, none had a significant left ear 
effect. 

Vowels showed a very different pattern of results. Overall performance 
was considerably higher: subjects were 81% correct in identifying the vowel 
in the monitored ear. Furthermore, there was no ear effect for the group data. 
But the group average is misleading. Seven subjects did have significant ear 
effects: three had a right ear effect, and four had a left ear effect, as 
shown in Figure 1. 

Chi square index and phoneme class comparisons . To pursue the id j 
an ear-effect continuum for stops, liquids, and vowels, we must be abl. r 
compare ear effects for the three phoneme classes. To do this we must c. ^i- 
pensate for the different performance levels. A chi square analysis takes 
this consideration into account. ^ The analysis is performed on a 2 X 2 con- 
tingency table. The cell entries are the number of trials for a) right ear 
correct, b) left ear correct, c) right ear incorrect, and d) left ear incor- 
rect. A chi square is computationally al;ays positive. However, if we 
arbitrarily assign positive values to right ear effects and negative values 
to left ear effects, we have an index which distinguishes between the two 
results. A two-way chi square index was computed for each subject for each 
phoneme class with p< .025 as the criterion for rejecting the null hypothesis. 
Since the chi square index is a monotonic transformation of the original data, 
the chi square indices are suitable for further analysis. 

Figure 2 shows the ear effects and ranges for the stops, liquids, and 
vowels arrayed in the order of their encodedness from high to low. The data 
is plotted in terms of percent right ear correct minus percent left ear cor- 
rect. Thus left ear advantages yield negative scores. Note that fhr array 
appears to show an ear-effect continuum: right ear effect for the ^ >ps is 
greater than that for liquids, which in turn is greater than that for vowels. 
This linear trend from a large right ear effect for stops to no ear effect 
for vowels is also reflected in the ranges of the phoneme classes. A trend 
test (Winer, 1962) showed that this linear relationship was significant 
tF(l,45) - 9.56, p< .005] by analysis of variance on the subjects* chi 
square indices. ^ Furthermore, nine subjects showed this relationship: stops 
greater than liquids, greater than vowels. By chance alone this is a very un- 
likely outcome (z - 3.91, p < .0005). Only one subject had ear effects in 
the reverse order. 



I would like to thank Gary Kuhn for many suggestions which led to the use of 
of this statistic. 

In this type of analysis it is also necessary to consider other possible 
trends. Since there are only three phoneme classes, the only possible 
trends are linear and quadratic. The quadratic trend did not approach 
significance tF(l,45) ■ .77, ns]. 
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Figure 2: Means and ranges of subjects' ear effects for the three phoneme 
classes arrayed In the order of their encodedness. 
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A possible difficulty with the present study Is the correspondence be- 
tween phoneme class and temporal order. Stops are always first, liquids 
second, and vowels third. Nevertheless, the average ear effects shown In 
Figure 2 compare favorably with the results from other studies. Shankweller 
and Studdert-Kennedy (1967) found a 14Z right ear advantage for stops In 
CV syllables. They also found a net 9Z right ear advantage for both Initial 
and final stops In CVC syllables (Studdert-Kennedy and Shankweller, 1970). 
Haggard (1971) found a 5Z right ear advantage for Initial liquids and semi- 
vowels In CVC syllables. For stead/-state vowels Shankweller and Studdert- 
Kennedy (1967) found a non-slgnlf leant 4% right ear advantage and Chaney and 
Webster (1966) found a IZ right ear advantage. 

Identity trials. Most of the errors (92Z) occurred on contrast trials 
where phonemes of a given class were not shared. The remaining 8Z occurred 
on Identity trials. Few errors occurred on these trials because the same 
phoneme was presented to both ears. If /k/ was part of the stimulus In both 
ears, the subjects had little trouble In identifying the /k/ as part of the 
stimulus In the monitored ear. That errors occurred at all was probably a 
result of acoustic differences between the two Instances of the same phoneme 
For example, a /k/ before llml Is slightly different than a /k/ before Ixtl . 
Although Identity- trial errors were relatively few, significantly more errors 
were made when subjects monitored the left ear than when they monitored the 
right (z « 3.52, p <.0005). There were no differences among the stop, liquid, 
and vowel classes for these errors. All had significant right ear effects. 

Individua l phonemes > Using a chi square analysis, we can also assess 
ear effects for Individual phonemes within each phoneme class. There was no 
difference between the two phonemes 1^ either the stop or the vowel classes. 
Both /g/ and /k/ had similar right eAr effects. Both lei and /a«/ had no 
ear effects. 

There was, however, a difference between the liquids. Subjects had a 
12Z right ear advantage for 111 and a 5Z right ear advantage for /r/* This 
difference was significant (p < .05) by a Wllcoxon test on the chl square 
Indices (Slegal, 1956). The occurrence of this differential ear effect for 
111 vs. It I Is puzzling. The liquids often present puzzling problems In 
speech perception and speech productions; for a description of other Interest- 
ing phonomena see Cutting and Day (In press). 

Summary. Sixteen subjects were tested In a dlchotlc ear-monltorlng task 
using stop-llquld-vowel nonsense syllables. The results showed that 

1) There was an overall right ear effect for reporting the lonltored 
syllables. 

2) The ear effects for stops, liquids, and vowels were arrayed along 

a continuum. There was a larger right ear effect for stops than liquids, and 
a larger right ear effect for liquids than vowels. This relationship parallels 
an encodedness continuum for the same phoneme classes. Stops undergo more 
nontext condition variation than liquids, and liquids undergo more variation 
than vowels. Thus, the present study lends evidence for a parallel between 
the two contlnua. 
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Mutual Interference Between Two Linguistic Dimensions of the Same Stimuli* 
Ruth S. Day"*" and Charles C. Wood"*^ 



A single speech stimulus can be considered as a composite of values along 
many different dimensions. For example, a token of the syllable /ba/ will 
have a particular fundamental frequency, overall intensity, initial second- 
formant transition, formant values for the vowel, and so on. We are inter- 
ested in the extent to which a given dimension can be processed independently 
of the others. An interesting and efficient way to study this problem is to 
select two dimensions and pit them against each other in a choice reaction- 
time paradigm. Subjects must attend to one dimension and ignore the other. 

dimensions studied in the present experiment were stop consonants 
(differing in place of articulation) and vowels. On each trial a single syl- 
lable was presented binaurally. In one task subjects had to target for the 
stop consonants, while in the other they had to target for the vowels. 

Stop Consonant Task. Subjects pressed button H as soon as they heard /b/ 
and button #2 as soon as they heard /d/. This task was performed under two 
conditions of stimulus variation as shown in Figure 1. In the One-Dimension 
Condition, the target dimension (place of articulation for stop consonants) was 
the only one that varied: the stimuli were /ba/ and /da/.l A mean reaction 
time of 400 msec was obtained. In the Two-Dimension Condition, both stop con- 
sonants and vowels varied: the stimuli were /ba, bae, da, dae/. Again, sub- 
jects had to identify stop consonants, but they also had to ignore irrelevant 
variation in vowels. They had difficulty in aoing so, as reflected by increased 
reaction time: the mean rose to 450 msec. 

^"Yq-*- Subjects pressed button *#1 as soon as they heard /a/ and 

button #2 as soon as they heard /ae/. The same two conditions of stimulus vari- 
ation were used (Figure 1). In the One-Dimension Condition, the target dimension 
(formant values for vowels) was the only one that varied: the stimuli were /ba/ 
and /bac/." The mean reaction time here was 348 msec. In the Two-Dimension Con- 
dition, the same four stimuli as in the Stop Consonant Task were used. This 
time, subjects had to identify vowels and ignore irrelevant variation in stop 
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■""Or, in another block of trials, /bae/ and /dae/. 
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Or, in another block of trials, /da/ and /dae/. 
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414*^86^** * difficult task: mean reaction time rose to 



The results of the experiment are summarized In Figure 2. We are Inter- 
in '^^''^ reaction time for each target dimension Increased 
„ f Condition. Both tasks yielded a sizeable Increase. Thus 
there was a mutual Interference effect: Irrelevant variation In either dimen- 
sion Interfered with perception of the other. 

How might these results be explained? One possibility Is that the per- 
ceptual processors for place of articulation and for vowels are strongly Inter- 
dependent. Such perceptual Interdependence may well reflect known Interdepen- 
dence at artlcuiatory and acoustic levels. 

If this analysis Is correct, then dimensions whose processors are not so 
strongly Interdependent ought to yield a different pattern of results In this 
paradigm. Recently, we reported such a study (Day and Wood, 1972) In which 
the same stop consonants served as the linguistic dimension while fundamental 
frequency served as the nonllngulstlc dimension. (Fundamental frequency Is 

nwililfi ? J" ^' <^«'y linguistic Information at the 

phoneme level In English.) The results are shown In Figure 3. Again, the Stop 
Consonant Task showed a large Increase In reaction time in the Two-Dlmenslon 
condition. However, the corresponding Increase for the Fundamental Frequency 
^sk was much less: It barely reached statistical significance. 3 Thus there 
was a unidirectional Interference effect. In that It was much more difficult to 
Ignore Irrelevant variation In fundamental frequency while Identifying stop 

consonants than vietk vai-aa. o r 



consonants than vice versa. 

The pattern of results for the stop consonant vs. fundamental frequency 
experiment suggest that these two dimensions behave In very different ways In 
the two-choice Identification paradigm. When processing stop consonants, the 
listener cannot disengage his processing operations for fundamental frequency: 
S?!!n!!;.1ir PJ°""^"8 fundamental frequency, he can, to a considerable extent, 
disengage his linguistic processing operations. In fact, some subjects report 

J"v of what consonants are occurring during the Fundamental 

Frequency Task; no one reports being unaware of pitch differences In the Stop 
Consonant Task. *^ 

n J'.J^ consider cases where both dimensions are non- 

iid*ni!^f?l' ii'^^^"'!** fundamental frequency 

and overall Intensity and obtained a mutual Interference effect: both dlmen- 

^ "i'** "^""^ ^° These results are compar- 

tlzL !^ of the present experiment. Note that In the present experiment 

there were two linguistic dimensions, while In that of Wood there were two non- 
linguistic dimensions. A mutual Interference ef ^ .ct may be a direct consequence 
Jf, ^.i? ^f^!! perceptual processors for the two dimensions. Thus 

far, this effect has occurred only for cases where both dimensions were from the 
same general class, that Is, both linguistic or both nonllngulstlc. The only 
cases where a unidirectional effect has occurred employed a dimension from each 



3 

In a recent replication of this experiment, Wood (1972) obtained no Increase 
for the Fundamental Frequency Task. 
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of the two general classes. 

The status of each dimension as linguistic or nonlinguistic, then, appears 
to be important in predicting outcomes of these two-choice react ion- time experi- 
ments. There are, however, other factors that may be involved. In the experi- 
ments reported thus far, information about both dimensions is available from the 
onset of the stimulus. Situations where this is not the case may behave very 
differently. For example, variation in voice onset time vs. fundamental 
frequency would delay the onset of fundamental frequency information relative 
to stop consonant information. By studying such a situation we will be able to 
determine the extent to which temporal processes are important in perceiving 
various dimensions of the speech signal. 

Another factor of possible interest here is the relative discriminability 
Of each dimension. Thus far we have used pairs of dimensions that are of 
roughly comparable discriminability. it will be of interest to see whether 
decreased discriminability of certain dimensions will alter the basic pattern 
of results more than others. 

Summary. Subjects listened to simple consonant-vowel syllables that 
varied along two dimensions: place of articulation for stop consonants and 
formant values for vowels. When they had to identify values along one dimen- 
sion, it was difficult to ignore irrelevant variation in the other dimension. 
This was true for both dimensions to the same extent. These results are com- 
patible with an explanation that emphasizes the degree of interdependence 
between processors for linguistic and nonlinguistic dimensions. 
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The Phi Coefficient as an Index of 
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Ear Differences In Dlchotlc Listening 



INTRODUCTION 

where^*'"*^*^*'*'"^*'"***^ Shankweller (1970) have applied the Index 



R » the number of correct right-ear responses 
L ■ the number of correct left-ear responses 

to measure ear differences In dlchotlc listening tests. If the task of the sub- 
t identify one stimulus on each dlchotlc presentation, as 

was the case above, then this Index may yield Its maximum value of + 1 regard- 

stwf ll\lrl^TX\]^''^^ performance. But If the task Is to Identify both 
stimuli on each dlchotlc presentation, then the maximum value of the Index 
decreases rapidly, as overall performance rises above 50%. 

s^,„^^i:^/^**" l° ^^^^ "^^^"^ two-response paradigm, 

f^^JfoM^^ f °" «hlch one stimillus Is 

fliril^ "ported should be Included In the computation of the ear advantage 

effect of'!: ^^T: ''5'''"^- are, however, a few points abouf the 

effect of applying the Index In this way that may be worth while to keep In 

°^ one-correct trials may vary considerably across sub- 
h!™ Inu^? «"<^h differences In sample size, we cannot necessarily 

have equal confidence. In the statistical sense. In ear advantages of equal 
reported magnitude. * ch"-^ 

a..K^ proportion of one-correct trials may vary considerably across 

. ear advantage reported over one-correct trials could look 

HIm^JT^ . subjects whose advantages over all trials (measured for 

statistical significance) were of very different magnitude. 

Third, It may be that the proportion of one-correct trials varies system- 
atically across levels of performance. But a measure of ear difference that does 
not take performance into account is one that assumes that overall performance 
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fss^p'on'l^'^it'?" -^--nta^e. clearly this would be an Interesting 

APPLICATION TO TWO RESPONSES PER TRIAL 

r««non-I "^^''''^ f'^ be desirable to apply over all trials in a two- 

response paradigm an index of ear difference whose computed values can be 

p^Jo^^nL rr'"'"' significance, independent of the SJel of 

performance. Such an index may be derived from the x2. 

If we let 

L - the number of correct right-ear responses 

R - the number of correct left-ear responses 

T ■ the number of dichotic presentations or "trials" 

we may establish as the null hypothesis that the ears contribute equally to any 
oitc^! • * two-response paradigm, we can express the^pec^ed ' 

outcome of a subject's performance in the following contingency table: 

Ear of Presentation 





Left 


Riaht 




Correct 

Response 


R+L 
2 


R+L 
2 


R+L 


Category 

Incorrect 


T R+L 
2 


T R+L 
2 


2T - (R+L) 




T 


T 


2T 


However, a subject's observed performance will be 
following table: 


distributed 


Ear of Presentation 






Left 


Rlsht 




Response 


L 


R 


S+L 


Category _ 

° ' Incorrect 


T - L 


T - R 


2T - (R+L) 




T 


T 


2T 



(1) 



fr^nuInM^^n^.i^* °^ ^? l>«ween the observed and expected 

frequencies of these two tables we sum the values of the following table: 
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Ear of Presentation 



Response 
Category 



Correct 



Incorrect 



(L - <M:))2 


(R - <^>)2 


RfL 
2 


R+L 
2 


((T-L) - (T - 


((T-R) - (T - ^^^)2 


T R+L 
2 


T R+L 
2 



where each cell is of the form 



and 



0 ■ the observed number of responses for cell.. 
E ■ the expected number of responses for cenjj 

The sum of the evaluated expressions of these four cells Is a with one degree 
of freedom, which can be used as a measure of the observed ear advantage and as 
a test of the Hq. 

Note that the sum of the correct row equals 

.2 



2(R - <^)2 (JSW.)' 



R+L 
2 



R+L 
2 



(R-L)^ 
R+L 



and that the sum of the Incorrect row equals 



2((T-L) - (T - ^\-)2 (&^y 



T - 



R+L 



(R-L)^ 



T - 



2 ^ - ^ 

A simplified form for stating the four-cell sum Is then 
x2 . iSdili + (R-L)2 



R+L 2T - (R+L) 



R+L 



2T - (R+L) 



This sum Is a linear transform of the Index ^ with the y Intercept changing 
as a function of performance. 

If the observed number correct actually Is the same for the two ears, then 
the four-cell sum will take on a value of 0. At the other extreme. If one ear 
reports every stimulus correctly and the other ear none, then the generated 
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value will reach Its maximum of 
2 2 

r + r ■ 2T 

We may normalize the scale of possible values of the four-cell sum. A normal- 
izing procedure which reduces the maximum value of the sum to 1.0 and its dis- 
tribution under the to that of a normal variate consists of dividing 
the computed X2 by N, (N - 2T here), and then taking the square root of the 
resulting value. 



' - r? — 

norm* | coinp^ 



It turns out that the values of this normalized four-cell sum are equal in 
magnitude to those of the "phi coefficient," since 

- N phi^ 

(Walker and Lev, 1953:272). The phi coefficient is a correlation coefficient 
for two independent, dichotomously measured dimensions. If dimension 1 is either 
R or L and dimension 2 is either C or I then the 2x2 contingency table from which 
the strength of their phi could be evaluated would look like the following: 





L 


1 

R 




c 


b 


a 


a-»b 


I 


d 


c 


C4d 




b4tl 


a-fc 


a-tb+c+d 



where a, b, c, and d are ceU frequencies. For the special instance where the 
column totals are equal, 

a+c - b+d 

the computational formula for the phi coefficient reduces to 
phi - 

V (a+b)(c4d) 
(Walker and Lev, 1953:272). 

Given this relationship between phi and the X^ and the fact that the column 
sums in the observed frequencies contingency table (1) are indeed identical, we 
may compute the value of the normalized X^ index from 



phi 



R-L 



vj (R+L)(2T - (R+D) 
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One reason for favoring computation of the value of this Index through the phi 
formula Is that the direction of the ear advantage, I.e., Its sign. Is retained. 
Computed In this way, the Index can be thought of as yielding a value of cor- 
relation between correct performance and "right earedness": a negative value 
indicates a left-ear advantage. 

By way of example, suppose that in a dlchotlc listening iiest of 100 trials, 
where the task was to report both stimuli on each trial, a subject gave the 
following performance: 



L - 72 
R - 88 



His ear advantage would be 



. RlL 88-72 

phi - ^^ZZIZI^^Z:^^ ■ ■ » .20 

J (R4'L)(2T - (R+D) ^ (160) (40) 

Critical Values 



A table of the smallest significant or "critical" values of the index may 
be constructed by letting 

2 0 
X « the value of X^i jf with the desired level of significance 

N > 2T, i.e., the total number of responses 
and solving the equation 




The following tab7e has been calculated in this fashion and is Included 
here for illustrative purposes. From the table we see that the probability of 
obtaining an ear advantage large as the one observed for the hypothetical 
subject above is <.01. 



2T 


Probability under the Hg that phi - PHI 
.05 .02 .01 


48 


.282 


.335 


.371 


80 


.219 


.260 


.288 


96 


.200 


.237 


.262 




.195 


.232 


.257 


120 


.178 


.212 


.235 


160 


.154 


.183 


.203 


192 


.141 


.167 


.185 


200 


.138 


.164 


.182 


240 


.126 


.150 


.166 


320 


.109 


.130 


.144 


384 


.100 


.118 


.131 


400 


.097 


.116 


.128 


480 


.089 


.10b 


.117 


640 


.077 


.091 


.101 


960 


.063 


.075 


.083 


1000 


.061 


.073 


.081 



A Comparison of Indices 

For the sake of comparison, the ear advantages of five hypothetical sub- 
jects* have been computed using three Indices: 

1. over all trials 

2. over one-correct trials 

3. phi over all trials 

If the data for the 100 trials of a two-response paradigm are 



Subject 


Over all 


trials 


Over one-correct trials 




L 


R 


L 


R 


1 


87 


93 


2 


8 


2 


77 


83 


2 


8 


3 


67 


73 


2 


8 


4 


57 


63 


2 


8 


5 


47 


53 


2 


8 



then the values of the ear advantages would be 



Subject 


R-L „ 
RR*^^ 


R-L 
R4-L 


one-correct 


phi 


P(phl ^ PHI) 


1 


.033 




.600 


.100 


not significant 


2 


.037 




.600 


.070 




3 


.042 




.600 


.065 




4 


.050 




.600 


.061 




5 


.060 




.600 


.060 





'}0X Performance 

If a subject's performance level over a given s^t of trials Is 50%, then 
his errors equal his number correct 

2T - (R+L) - R+L 
and his total number of -esponses equals twice his performance 

N - 2T - 2 (R+L) 

His four-cell sum Is taen 

2 . (R-L)^ iR-Llf . 2(R-L)^ 
R+L R+L R+L 

And since 
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we have 



2 



R+L _ 2(R-L)" R-L 



P \/ 2(R+L) >J 2(R+L)^ 



R+L 



Since the values of the two indices are identical for the case of 50% per- 
formance, it might appear that values of ^ as computed over one-correct trials 
could be compared directly for statisticaJ significance. This is not so, of 
course, if the size of the one-correct sample varies from subject to subject. 



ONE RESPONSE PER TRIAL 
Using the same computational formula, 

R-L 



phi 



J((R+L)(2T - (R+D) 



the phi index could also be applied to the data of a one-response directed 
recall listening test. 

In this application, only a correct response from the requested ear is 
counted as correct under either condition of recall. Also, the quantity T is 
set equal to the number of trials under either condition of recall. 

The null hypothesis here is that the ears contribute equally to the 
requested R+L. Phi is computed once over both conditions. 

It does not seem to be appropriate to apply the phi to the data of a one- 
response, free-recall paradigm, since the performance of either ear may conceiv- 
ably exceed half the total number of responses, or looked at another way, since 
an incorrect response cannot be assigned to either ear. 

SUMMARY AND CONCLUSION 

The phi correlation coefficient is proposed as an index of ear differences 
in dichotic listening tests. It is proposed specifically for the two-response 
paradigm, where, as an index of ear difference over all trials, it would be 
statistically appropriate for correlation with overall performance. 

Using the same computational formula, the phi index may also be applied to 
the results of a one-response, directed-recall listening test. 

The interest of the phi index lies in the fact that for a constant size of 
response set and number of dichotic trials, its values may be directly compared 
for statistical significance, independent of the level of performance. 
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The Relationships Between Speech and Reading 
Ignatius G. Mattlngly"*" and James F. Kavanagh 



For scientists who have a special concern with language— researchers In 
linguistics, phonetics, speech science, experimental psychology, and communica- 
tions engineer lng~no subject In the school curriculum arouses as much Interest 
as reading. It Is Impossible to speculate vary deeply about reading without 
touching on the nature of thought and language, and on the fundamental role 
that reading plays In this society. At first, of course, because his own 
experience of learning to read Is so far In the past, the speculator takes his 
literacy for granted. Just as he does his ability to speak and to listen to 
language. It Is regrettable that some have speculated no further and rashly 
Issued ex cathedra directives about the proper methods of reading Instruction. 
Those who do consider a little further realize that reading Is really a rather 
remarkable activity which could hardly have been predicted from what Is presently 
known about the production and perception of speech and languafc. 

Recent research by linguists In generative grammar and by experimental 
phoneticians In speech perception has. If anything, made reading seem even more 
remarkable. The form of natural language, as well as Its acquisition and 
function, Chomsky (1965) tells us, are biologically determined. There Is good 
reason to believe, according to Llberman et al. (1967), that linguistic comnun- 
Icatlon depends on some very special neural machinery. Intricately linked In all 
normal human beings to the vocal tract and the ear. It Is therefore rather 
surprising to find that a substantial number of people can also, somehow, per- 
form linguistic functions with their hands and their eyes. Reading seems more 
remarkable stlll-^when one considers that only In modem Western culture Is It a 
basic social skill. Some civilizations have attained a high level of culture 
without being literate at all; In many others, reading and writing were the 
prerogatives of the hierarchy or the skills of the specialist. But this society 
Insists that everyone learn to read and. If he wishes to obtain or retain middle- 
class credentials, to read 5n silence, rapidly and efficiently. In Augustine's 
(397 A.D.) Confessions (Book VI), he records his amazement on finding that when 
his teacher, Ambrose, was reading, "his eye glided over the pages, and his heart 
searched out the sense, but his voice and tongue were at rest... the preserving 
of his voice (which a very little speaking would weaken) might be the.*. reason 
for his reading to himself." How surprised Augustine would be if he could see 
millions of children learning to do Ambrose's little trick. 



Paper presented at the International Reading Association Convention, Detroit, 
Mich., May 1972. 

"*^Haskins Laboratories, New Haven, and University of Connecticut, Storrs. 
National Institute of Child Health and Human Development, Bethesda, Md. 
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nUn-i !LS ^* ''2°' * 8'°"P' Including researchers In all the discl- 

Sii^ ?„!mJ ^'^^^ sponsorship at BeLnont. the SmUhso- 

nlan Institution conference center in Haryland. for three days of papers and 

part""i:j SLi^'L'l^'^^rr speech'and readlng.F%:; ?Je"os^ 

"eas leZtlTtJll^lt ft specialized not in the study of reading but in 
nJnn!i ? * interesting ways: speech production and perception 

aSo InfLiJ °T"°" language acquisition, memory! BuE t^' group 

yearsf * °" research in reading for many 

fr^ JIl! *"^«5"f^ purpose of the conference was to consider speech and reading 
^^ii^ cZ*"?: ^'^^ linguistic points of view, but the cultural role of 
reading came in for some heated discussion as well. In retrospect, it seems 
l^lti^rt r throughout the conference 

question arose in various guises which may seem quite dissimilar at first Its 
»ost familiar guise is the question of reading readiness: lust w^^t. besides 

read? Zti" T'^? """"^'^ "^^^^ child^^lea^ to 

Ivlil ^m2^ thZ;^ k''* "^L"**^"8 listening, as Bloomfield (1942) nnd 
^tii^A ^ thought, be regarded simply as parallel processes in different 
J^f : converging at some point on a common linguistic path? Or, finally, 

^he rlUlL^l '"k%"°" abstractly: is it really possible to ripre^t"^ 

dlag^S ^ * nontrivial block 

To answer these questions, or at least to understand them better, it 

anr^fjjr'r!:';' '° r^^"^' * """^^ differences betwe^rsp^e^h'p^rception 

J are interesting because they cannot be attributed merely to 
differences in modaHty.2 To begin with, listening is easy and reading is hard. 
Jiioiii are spoken languages, and every normal child acquLe^ 

iTnt^i "^turation a tacit knowledge of the grammatical rules of his native 
Se'^hlS h^*"/''"^'' understand it. In fact. „e are forced to conclude that 
the child has in some sense an innate ability to perceive soeech for wli-hm.f 
some such ability he could not collect the iLguiStic data E^t Ch^skf aSeS) 
ll7i^s"l :ra??n° infer these grammatical rules. Indeed, some recent work 
by Elmas et al. (1971) suggests that a four-week-old infant is capable of 



J^twe^SoeecS Zl T'^J^ "Communicating by Unguage-The Relationships 
Between Speech and Learning to Read." Those who attended or contributed to 

JrLer f r« ^^'^'^ff* P'*""^ authors. Wi?U«m1? 

ll^lr vt. "^"iib ^' ^"""^in S. Cooper. Robert 

?coth!;J5 T";/- *torris Halle. James 5. Jenkins 

Mvln^^i:^' ^T** '^^"^^ David UBerge. Joe L.^ewJs. 

Alvin M. Liberman (co-chairman). Isabelle Y. Liberman. Lyle L Lloyd. Jota 
Lotz. Samuel E. Martin. George A. Miller. Donald A". N^nJn. Wayie O'Neil 

i'!' Stevens. The conference proceedings be 

? G S^ti: ?''T'*' *5 ^n«ua«e bv Ear and Hv V (j. |? 2^^^^^ and 

L ^hSii <The papers given by C^ii^ by I. Y. LibeSaan 

and Shankweiler. and by Hattingly appeared in SR-27.) 

SlJ!n^g"l568)! ^^^^"^^ conference 



phonetic dlscrlnilnatlon. On the other hand, relatively few languages In the 
history of the world have been written languages, and the alphabet seems to have 
been Invented only once. In general, children must be deliberately taught to 
read, and despite this teaching, many of them fall to learn. Someone who has 
been unable to acquire language by listening— for example, a congenltally and 
profoundly deaf child—will hardly be able to acquire It by reading; on the 
contrary, a child with a language deficit owing to deafness will have great 
difficulty learning to read properly. 

Secondly, the form in which information is presented is basically 
different for the listener and Che reader. The listener is processing a com- 
plex acoustic signal in which the speech cues lie burled. (A "speech cue" is 
a specific acoustic event that carries linguistic information, for example, 
the aspiration that distinguishes voiceless /p, t, k/ from voiced /b, d, g/.) 
The cues are not discrete events, well separated in time and frequency; they 
blend into one another in complex ways. The segmental sounds the listener 
perceives quite often have no obvious segmental counterparts in the signal* To 
recover the phonetic segments, the listener has first to separate the speech 
cues from a mass of irrelevant detail. The process is largely unconscious; and 
in many cases a listener is quite unable to hear a speech cue as a purely acoustic 
event; he hears only phonetically (Mattlngly et al., 1971). The complexity of the 
listener s task is indicated by the fact that no scheme for speech recognition by 
machine has yet been devised that can perform it properly. The reader, on the 
other hand, is processing a series of symbols which are quite simply related to 
the physical medium which conveys them. The marks in black ink are information; 
the white paper is background. The reader has no difficulty in seeing the 
letters as visual shapes if he chooses to, and optical character recognition by 
machine, though it is a very challenging problem for the engineer, is one that can 
be solved. 



If reading and listening differed only in modality, one would expect that a 
visual presentation of speech that preserved the essential linguistic information 
could be easily read and, conversely, that an acoustic representation of written 
text which clearly differentiates the sounds representing the letters would be 
easy to listen to. But neither prediction is correct. It is possible to display 
speech visually in the form of a sound spectrogram, which shows the distribution 
of energy in the acoustic frequency range over time. We know that a spectrogram 
contains most of the essential linguistic information, for it can be converted 
back to acoustic form without much loss of intelligibility (Cooper, 1950). Yet 
reading a spectrogram is very slow work at best, and at worst, impossible. The 
converse task, "reading" written character^ represented in acoustic form, is 
somewhat easier but not very fast. For example, Morse Code, or the various 
acoustic al habets for the blind reader, can be understood only at rates much 
slower than ^ typical listening rate for speech. 

Finally, the number of different sounds used in speech in all the languages 
of the world is relatively small. These sounds can be classified in terms of 
their component phonetic features — voiced or voiceless, stop or fricative, 
labial or dental or velar— and the number of these features is very small- 
fifteen or twenty at most (Stevens and Halle, 1967). But the situation with the 
writing systems of the world, as one can verify by spending an hour or two looking 
at the plates in David Diringer's book. The Alphabet (1968), is very different. 
Formally speaking, the symbols used in writing systems have an endless variety, 
and 80 do conventions for arrangement of symbols on the page. Swift (1727) does 
not exaggerate in his description of the writing system of the Lilliputians in 
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Gulliver s Travels ; "Their manner of writing Is very peculiar, being neither 
from the left to the right, like the Europeans; nor from the right to the left, 
llKe the Arabians; nor from up to down, like the Chinese; nor from down to up 

7:5?. ? ^l^'^^^^V'f.' 5"^ of the paper to the other, like 

ladles In England." (Book I, Chap. 6) 

However, If one looks at a writing system not just as an ensemble of visible 
marks but as a representation of some linguistic level, one finds a more orderly 
variation. The possible levels seem to range from the morphemic to the phonetic. 
Chinese characters are essentially morphemic; no Information about pronunciation 
is given. If one wishes to read aloud In some dialect of Chinese one must have 
memorized the phonetic values of the characters In that dialect. The English 
writing system, as Chomsky (1970) has remarked. Is essentially morphophonemlc . 
Thus we use the letter s for the regular plural morpheme even though It Is phonet- 
ically realized not only as [s] In cats but also as [z] In cans and as [az] In 
cases. The orthography preserves the morphological relatloilihlp between sign and 
signature even though the phonetic vowel written as 1 Is different In thelw? 
words and the ^ Is pronounced In signature but silent In sign . But, as Hartln 
points out In his conference paper, English, unlike Chinese, does not always define 
the morpheme boundaries clearly. Are misled , molester , and bedraggled to be read 
J^TT- 7?^' S!Sieft+er , and beWraggled or as ml8l-*«d , jgolefster, and bed- Kraggled ? 
tioi: phonetinivel, f^lnstSnce 

^h^t pn or Spanish. Either their morphology Is less complex than 

l^^L! - ? !^ " morphological complexity Is masked by the written 

language for the sake of phonetic regularity. In his conference paper, Klima 
explores this range of orthographic variation from a theoretical standpoint, pro- 
posing several conceivable orthographic conventions for representing morphological 
and phonological content of sentences. e f b 

Twenty years ago. It could have been said that the range of writing systems 
spread over most of the known linguistic domain and that In principle there was 
no interesting restriction on the linguistic levels they repre!.ented, but the 
findings of the generative grammarians and the experimental phoneticians compel 

Jn «!^n^"^^°; °^ " ^^"^ extensive areas 

in semantics, syntax, and speech perception which are part of the speaker's 
competence in his native language. Yet, except for the purpose of examples in 
the literature of linguistics and phonetics, one does not encounter writing con- 
sisting of deep structure tree diagrams and transformations, or, on the other 
d?^;n!!^J "f °f artlculatory patterns narrow phonetic transcriptions, 

" «P«<^trographic pattems.3 Thus, it now appears possible 
to make a significant generalization about writing systems. They actually 

nStur'^'M^^ E?^"''^ ""'^ " conference, a relatively narrow linguistic 

stratum. Moreover, this stratum does not include the level at which the listener 
perceives speech, in short, writing tends to represent language at the morphemic. 



3 

JJ^J^i^ru''!^" interesting exceptions to this generalization. The 

Hankul alphabet of the Koreans (described by Martin in his paper for the 

Bell (1867) described by Dudley and Tarnoczy (1950) represent each speech 
sound by a symbol depicting articulation, and Potter, Kopp, and Green (1947; 
used a moving spectrographlc display in a project to teach the deaf to read 
speech sounds. 



morphophonemlc, or broad phonetic level, while speech represents language at 
the acoustic level. 

The differences which have been listed Indicate that even though reading 
and listening are both clearly linguistic and have an obvious similarity of 
function, they are not really parallel processes. Instead, a rather different 
account of the relationship of reading to language Is proposed. This account 
depends on a distinction between primary linguistic activity Itself and the 
speaker-hearer's awareness of this activity. Primary linguistic activity con- 
sists of the processes of producing, perceiving, understanding, rehearsing, or 
recalling speech. Many Investigators have come to think that these processes 
are essentially similar, since they all require the construction or reconstruc- 
tion of utterances In both phonetic and semantic form (Nelsser, 1967). As a 
cover term for all these processes, the term synthesis may be used. 

Having synthesized some utterance, the speaker-hearex is conscious not only 
of a semantic experience (understanding the utterance) and perhaps an acoustic 
experience (hearing the speaker's voice) but also of experience with certain 
Intermediate linguistic processes. Not only has he synthesized a particular 
utterance, but he Is also aware of having done so and can reflect upon this 
experience as he can upon his experiences with the external world. 

If langtiage were deliberately and consciously learned, this linguistic 
awareness would hardly be surprising. One would suppose that development of 
such awareness Is needed to learn language, but language seems to be acquired 
through maturation. Linguistic awareness seems quite remarkable when one con- 
siders how little Introspective awareness we have of the Intermediate stages of 
other forms of complex behavior, for exaiq>le, walking or seeing. The speaker- 
hearer's linguistic awareness is what gives linguistics its special advantage 
over other forms of psychological investigation. Taking his informant's 
awareness of particular utterances, not at face value but as a point of depar- 
ture, the linguist constructs a description of the informant's intuitive com- 
petence in his language which would be unattainable by purely behavior is tic 
methods . 

However, linguistic awareness is far from being evenly distributed over 
all phases of linguistic activity. As Kllma points out in his conference paper, 
some stages of linguistic activity are more "accessible" than others. Much of 
the process of synthesis takes place well beyond the range of innedlate aware- 
ness (Chomsky, 1965) and must be determined Inferentially. The speaker- b«».arer 
is unaware of the deep structure of utterances or of the processes of speech 
perception. He is aware of phonetic events and easily detects deviations, and 
this awareness can be Increased with proper phonetic tt lining. At the morpho- 
phon«ic level, reference to various structural units is possible. Words are 
perhaps most obvious to the speaker- hearer , and morphemes hardly less so, at 
least in highly Inflected languages. Syllables, depending on their structural 
role in the language, may be more obvious than morphophoMiic segments. In the 
absence of appropriate psycholinguistic data, any ordering of this sort must 
be very tentative, and in any case it would be a mistake to overstate the 
clarity of the speaker-hearer's awareness and the consistency with which it 
corresponds to a particular linguistic level. But it seems safe to say that, 
by virtue of thia awareness, he has an internal image of the utterance, and 
this image probably owes more to the morphophonemic representation than to any 
other level. 
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«Mpi«, In such LT„:;' : : "'o ton. one class of 

relating to the »orphoS^lc rf " sStatJof^J"' ' ^^^^^ of a rule 
production and perceptl™ if oS! C ! «"lflcally imposed upon 

mental operatloS is ?eqS;ed fo perJjL * additional 

the process at a nor»al sjSkJng'rate^ne Lr f^'T"' '° "'^^ ^ut 

ment rule but to have deveSped a ce«aJn fllT.l '° '^"^^ enclpher- 

class of examples are the v^Jlous llllltl ^"•^y J» applying it. a second 
Is skilled in synthesizing sentences^?;*, versification. The versifier 
the language buralso L an aSditwf set T T ""^^^^^ '° "^^^ 
phonetic features (Halle 1970) T^^f^ '"^^^ relating to certain 
passive form of this aktll Z a'i J 7 ° "^^ds at least a 

scanning them syJLL g Ce ^7%'/'°" '^'^"-t 

awareness of the phonetic^ Sd%t;ologV^rfh^L';^^^^^^^^ 

smis: ri'arpTu??; aL":er"lfJcaUon'"%""T language-based 
For one thing. the?e seems^ ^rcoLjLr!bL*?i.rfr'? "»8"i«'i<^ activity, 
tic awareness: some speakers are verv . indlv^.dual variation In llnguls- 

explolt their awareness^th o^lo^s ^l^f" °^ linguistic patterns aSd 
charades) and verbal wor^^Cli^J^JstJc'aS in verbal play (punning and 
never to be aware of much more*?^^ ^oSs T^r"^*" research). Others seem 
linguistic patterns are ported o^? to * Ju! ^^P'^^^^ when quite obvious 
edly with the relative uSJfoJ^trLoL J??; ^"s variation contrasts mark- 
lingulrtlc activity. MoreoveT if^e^ere ^nS^ll"^^^^^^ P'^'^ 
a system of versification, one mleht LTl r« ? /^'^ ^•^^^ with 

or the versifier was up t^, S^t oS w^Sw ""**«"*»«»,w»>at the Pig Latlnlst 
speaking an unfamiliar laniua^e LHifn ^^"^^ •^hem to be 

the sensation of engaging fTf^tM^. h! ^^•^"/"^ "«^<^hes on to the trick, 
does not disappear;^L c^ntl^^'S^feeH^^^^^ ^^f^^^'^ 
tic awareness. In short, synthesis of !n LV^ « llnguls- 

actlvlty Is one thing; the^areness of ?h?« ^» P'i»ary linguistic 

another. awareness of this process of synthesis Is qui 

The conclusion sugeested here la j- 
activity but a seconda^ la^uag^bLed^uir InS ' f ^'^^ linguistic 
linguistic awareness. Jhe in SSSh f iiit?^ T * 

the reader is determined not by the actual .T^ P"^^"'^^ ^"^^^^^ " 

veyed by the sentence but by the «lt«^ information to be con- 

of synthesizing the sentence, an I^areness ihf^J'Jf'JT'"*" °' P""" 
reader, since the reader Has i»Part to the 

and is familiar witJ^he co^^er^L^s of^^ irJS'"'*^ 

something approximating what tSe ^Jter JSte^fJ f*"' synthesize 
tence. * * writer Intended, and so understand the sen- 
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and sentences are stored in phonetic form In short-term memory durlna fh- 
mysterious process by which the understanding of utterances takes pllce. More- 
over, even though the writing system may be essentially morphophonemic, linguis- 
tic awareness is in part phonetic. Thus a sentence which is phonetically 
bizarre-- The rain in Spain falls mainly in the plain," for example— will be 
spotted by the reader. Again, many of those who manage to read and write 
ordinary text without "inner speech" or any signs of vocalization have to 
mumble their way through numerical computations, though the numerals, unlike 
alphabetic words, have no overt phonetic structure. Finally, Erickson et al. 
Cin press) have shown that in a test of recall from short-term memory, Japanese 
subjects confuse kanji characters that are homophones, even though the kanji, 
like numerals, have no overt phonetic structure. 

In conclusion, the questions raised earlier in this paper can be recon- 
sidered. What is required for reading readiness? Apparently some degree of 
linguistic awareness, in particular (for English, at least) awareness of morpho- 
phonanic segments. TWo of the conference papers directly support this view. 
Shank»eiler and I. Y. Liberman found that a group of poor readers could often 
identify the first segment of a word like Ihmgl but usually failed to segment 
nuft^'J'! correctly. Savin reported that his subjects, poor readers in 

Philadelphia schools, could not master Pig Latin and shied away from any word 
game involving segmentation, but they were happy enough in games where syllable 
recopitlon was a sufficient skill. One begins to understand why the alphabet 
was Invented only once. 

Are reading and listening parallel processes? Evidently not. Reading 
appears rather to be parasitical on spoken language, exploiting the reader's 
awareness of the contents of short-term memory. And finally, can the processes 
of reading and speech be represented on a single block diagram? Not very 
easily, because one of the boxes in a block diagram of reading must itself 
include the kind of partial knowledge of the block diagram of listening and 
speaking that has here been called linguistic awareness. 
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The goal of research on reading machines for the blind at Haskins Lab- 
oratories is to produce by machine methods an output of clear, audible 
English from an input of ordinary printed text. The core problem — gener- 
ating acceptable speech from phonetic spellings—seems very near a success- 
ful solution through synthesis-by-rule methods. There is still much to be 
done by way of evaluating and improving the synthetic speech, but the re- 
search can now turn to some of the other problems involvp.d in setting up a 
complete Reading Service Center for the blind. During the six months covered 
by this report, attention has been focused on two main endeavors: evaluation 
studies of the reading machine output have continued with blind students, and 
further progress has been made toward automation of the entire print-to-speech 
generating process. 

Evaluation by Blind Students 

Continuing the work reported in the previous issue of the Bulletin, two 
studies have been made of student reactions to hearing some of their regular 
textbook assignments in the medium of synthetic speech. ^ For the first «»tudy, 
with the help of faculty at the University of Connecticut, ten recorded pas- 
sages totaling 2-1/2 hours of listening time were administered to six blind 
students. These passages covered chapters in psychology and psychiatry as 
well as ancient and modern literature. The content fell broadly into two 
classes: either basically simple prose style or more elaborate composition 
demanding close analytical attention. 

Following these trial readings, the comments of the blind students showed 
general agreement on five points. First, all the students found the speech 
intelligible, and although an occasional word was missed, they had no trouble 
in following the meaning of the simple prose; however, some students found 
difficulty in concentrating on the subject matter of the more complex mater- 
ial. Second, all students were favorably impressed by the stress and intona- 
tion aspects of the speech. Third, all students complained about the "cold- 
in-the-head" quality of the speech, but the samples used were too short to 
determine whether the students would acclimate to this aspect of voice quality. 
Fourth, all students thought that the speed of presentation of the samples 
was too slow. [The rates ranged from 109 to .56 words per minute. The 
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latter Is within the normal of human speaking rates but the long silences 
(2 to 8 sec.) between some sentences in these early recordings made the over- 
all rate seem slow. These undue silences were eliminated in subsequent 
recordings.] Finally, long and often unfamiliar polysyllabic words were 
recognized easily. The words missed were usually monosyllables embedded 
in sequences of other short words. 



In a second study during the late fall of 1971, anotl.er series of tapes 
was prepared at substantially higher word rates (164-221 wpm) . These tapes 
received the benefit of more recent refinements in the rules for producing 
and recording synthetic speech. The most detailed comments on the second 
tests were obtained from two female students who voiced opposing views that 
were, however, typical of the group as a whole. Both students noted the 
improvements in naturalness compared with the earlier tapes. The first 
student ndicated that, for passages that were complex (in topic, grammar, 
or vocabulary), she might well have preferred a slower rate. This student 
noted that her difficulty in focusing attention on the content (rather than 
on the voice quality) might disappear with longer experience in listening, 
but she was uncertain about how well she could use synthetic speech as a 
primary study tool. The second student, who listened to a text having a 
simple narrative style, was enthusiastic. She claimed to have missed only 
two words in a 15-minute recording that "spoke" at 221 wpm. She felt that 
she could make use of synthetic speech as a primary study tool. 

Plans for Further Testing and for a Reading Service Center 

More sophisticated tests are scheduled for the spring of 1972. In ad- 
dition, a faculty committee at the University of Connecticut is now actively 
planning further steps toward the development of a Reading Service Center 
which will be located on the campus and will utilize the Haskins Laboratories 
speech synthesis facilities. These plans call for a two-part program com- 
mencing with a 12- to 18-month study of the human, economic, and technical 
factors involved in the operation of such a Reading Service Center. Haskins 
, Laboratories will be involved in this study as a supplier of synthetic speech 
material to the University, using the automated facility currently being 
developed. The University researchers will be responsible for distributing 
the tape recordings and conducting sequential listening tests, both with 
blind students (some of them veterans) enrolled at the University and with 
blind students in schools and colleges throughout Connecticut. The second 
part of the program will incorporate the data from these studies to make 
decisions on the size and type of computer and optical character recognition 
equipment required for an on-campus Reading Service Center and to seek fund- 
ing for its implementation during the 1973-74 academic year. 

Automating Text Preparation 

At the Laboratories, the task of automating the production of synthetic 
speech from an input of printed text continues. Enquiries are in progress 
toward the acquisition (on lease) of a limited-font optical character recog- 
nizer. Needed is a suitable machine for converting text that has been typed 
in OCR-A or^ (upper and lower case) type face into an alphanumeric code on 
magnetic tape. (The choice of a "simple" OCR machine and human typists for 
the initial production phase is based primarily on cost consideratlc.is. 
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Multifont machines to read book pages directly are available and their 
higher cost will be justified when a higher level of text production is 
wanted. ) Optical character recognition represents the first of six stages 
involved in the production of a synthetic speech recording. These stages 
are shown in Figure 1. 

Following the recognition stage, the words of the text are converted 
into phonemic form by means of a dictionary look-up (Stage 2). This dic- 
tionary now contains about 150,000 entries which are distributed in three 
compartments, with room for several-fold expansion. The first compartment 
contains a few hundred of the most frequently used words such as "the," 
"of," etc. In the second compartment is stored the overwhelming bulk of 
all entries. To facilitate access, this main store has been divided into 
functional "pages," which are referenced from a page-size table of contents. 
Locating an entry in the main store entails a two-part search, first through 
the table of contents, then through a page. The third compartment contains 
all oversize words (length greater than sixteen letters). 

Each word entry, in both the high-frequency and main stores, contains 
the orthographic spelling, the phonetic respelling, and an indication of 
the word's usual grammatical functions. The initial version of the main 
store has now been completed, and programs for searching it are being written. 
These programs allow for editorial intervention to introduce new words that 
are not now available in the dictionary, as well as to correct errors. 

Stress and Intonation 

In the third stage, the phonemic string generated by the dictionary 
search is processed to introduce the stress and intonation features required 
to guide the synthesis program. Each dictionary word is (by the rules we 
are using) a member of one of five main stress classes: Low Stable, Low 
Unstable, Mid Unstable, High Stable, High Unstable. (Words with unstable 
stress shift their stress grade in specified contexts.) In general, low- 
stressed words are the so-called function words of speech (articles, pre- 
positions, auxiliary verbs, many pronouns, connectives); words with mid 
stress are modifiers and verbs in the past tense (and past participles); 
high-stressed words are nouns (or multi-use words that can be nouns), words 
of four or more syllables, numerals, certain emphatic words, comparative 
and superlative forms of adjectives, and a small number of semantically 
special words that tend to receive full stress in normal speech. 

In the fourth stage, the phonetic strings from Stage 2 and the stress 
and intonation assignments from Stage 3 are combined into a series of 
syllable-generating digital instructions by the computer program. These 
instructions are realized as a synthetic speech wave form by the synthesizer 
(Stage 5) which is recorded as a series of audible sentences in the final 
stage. 

Recent work has centered on adjusting the specifications for the basic 
Aiuerican English sounds (the phonemes) for better compatibility at fast word 
rates (above 150 wpm) , modifying the speech program to provide pauses of 
various lengths, and refining the stress assignment rules for complex texts. 
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delivered after long delays, sometimes up to several months. Ultimately, we 
believe, the solution lies in the establishment of a national network of Reading 
Service Centers utilizing reading machines capable of generating synthetic 
speech at a rate many times faster than natural speech (a rate of twenty to 
thirty times faster would be practical) and of recording the output on tapes 
moving at a proportionally fast rate. The speech could then be replayed by the 
listener at a normal speed. Braille could be provided as well when desired. 
These centers could be based in large regional libraries and would rapidly pro- 
vide taped synthetic speech in response to requests made by mail, telephone, or 
in person. The service they would provide could not fail to have significant 
economic and social value by permitting a far larger segment of the blind popula- 
tion to contribute their skills to society. 

The goal of a national network is, of course, far reaching and, in compari- 
son with current expenditure on reading services for the blind, is likely to be 
considered expensive. But the problems to be faced i.i estaMishing such a net- 
work are not only economic. As stated elsewhere (Nye and Bliss, 1970), there 
are often many other difficulties to be mot in our society in establishing an 
effective interface between a technical capacity and its potential field of 
application, and these are exemplified in the field of sensory aids for the 
blind. In fact, most of the difficulties are acutely visible in the whole bio- 
engineering field (Task Group, 1971). 

The r'^search work on the development and improvement of synthetic speech 
has been in progress for a number of years. Further progress can be expected. 
Nevertheless, we believe that the point has now been reached when it is necessary 
to evaluate our progress and to determine whether the speech is good enough to 
apply in its present form. Our reasons are the following. First, synthetic 
speech, although not yet perfectly natural, has been developed to the point 
where it is intelligible to people who have received no prior exposure to syn- 
thetic speech or training in its use. Moreover, this is true of synthetic 
speech delivered at rat^s in excess of 150 wpm. No other reading machine output 
intended for use by the blind can make such a claim. It can be aigued, there- 
fore, that the value of synthetic speech has already been established and the 
question of how it may be deployed to provide a useful service can now be given 
serious consideration. Second, there is an immediate need in the blind commun- 
ity (particularly among students) for an increase in the supply and speed of 
delivery of spoken text. A reading machine is ideally suited to the task of 
producing large volumes of material quickly and can already start to fill the 
gap in present services by supplementing the material produced by human readers. 
Third, although synthetic speech appears at present to be at an economic dis- 
advantage when coipared with naturally produced speech, the costs of operating 
reading machines c^n be e<pe:ted to fall in the future, whereas human labor 
costs will certainly increase. T>^« eventual widespread use of reading machines 
is therefore inevitable. This conclusion leads to our fourth point which is 
that the initial entry of automated techniques into any new arena can always be 
expected to be met by new a ad often unforeseen problems. Such problems are 
usually amenable to solution. However, they first need to be identified and 
then time must be allocated to find ways of circumventing each difficulty. We 
believe this to be true for reading machines and that, in the interest of com- 
prehensiveness, we cannot afford to delay any longer the task of evaluating 
synthetic speech with blind people under field trial conditions. 
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Thus the University of Connecticut and Raskins Laboratories are collabor- 
ating in the development of an evaluation program leading to the construction 
u Reading Service Center on the University campus— initially to serve 

the blind students enrolled there, but with the eventual goal of extending the 
service to other blind people statewide. The pilot center will be a first step 
toward setting up similar centers elsewhere. The evaluation and development 
work that we are undertaking builds upon the research carried out at Haskins 
Laboratories under Veterans Administration support. Initial reading tests with 
synthetic speech texts have already been conducted on blinded veterans, as well 
as with blind students, with encouraging results. 

THE TEXT-TO-SYNTHETIC-SPEECH PROCESS 

The speech synthesis system which we will use was constructed at Haskins 
Laboratories, both as a research tool for studies on the perception of speech 
and as a step toward the development of a reading machine for the blind. 
Figure 1 shows the sequence of steps involved in text-to-speech conversion. 

he characters, which are recognized by the optical reader, are grouped into words 
and recoded into phonemic form by means of an automatic dictionary. The phonemic 
text is punctuated with stress and intonation assignments and then transformed 
by another program into instructions for the control of a tenainal analog speech 
synthesizer. Synthetic speech output from the synthesizer is then recorded on 
tape for use by the blind reader. A substantial part of this system—the speech 
synthesis procedure which embraces the last three steps of Figure 1— is already 
fully operational. Input to this completed portion of the system (by way of a 
phonetic keyboard) at present requires considerable hand labor. This work will 
be avoided when the first three stages, which are currently under development, 
are made operational. 

Synthetic speech is currently being produced at Haskins Laboratories by a 
Honeywell DDP-224 computer which controls a hardware synthesizer designed by 
Cooper. To make the machine speak, a phonetically trained typist must translit- 
erate the printed text into a phonemic text and type it on a keyboard attached 
to the compute.-. Stress and intonation markers are assigned by programmed rules 
to punctuate the phonemic text, as described by Gaitenby, Sholes, and Kuhn 
(in press). The typed phonemic symbols and punctuation are then displayed on a 
storage oscilloscope which allows the operator to examine the input to the com- 
puter and to correct typographical errors if necessary. Using this phonemic 
input, the computer calculates values for the dynamically controlled parameters 
of the synthesizer on the basis of programmed rules devised by Mattingly (1968). 
These values are then fed to the synthesizer at a rate set by the operator. In 
P'^^'^^,"' ^P^®*"-^ '^^^ generated at rates from 60 words per minute (wpm) to 
over 300 wpm. However, a passage of speech lasting for ten minutes at a normal 
presentation rate may well take the phonetic typist at least on hour to prepare. 

^^'^^ "® propose to avoid the excessive labor and delay involves the 
addition of three major component steps which will automate to a large degree 
the tasks now performed by the phonetic typist and should greatly speed the 
process of transliteration. These steps will enable us to generate the relatively 
large volumes of reading material required to provide a reading service. 

The first step employs an Optical Character Recognition (OCR) machine. 
Primarily because the size of our evaluation study does not merit the use of a 
multi-font OCR capable of reading proportionally spaced ink print, we plan to 
have text material retyped in an OCR-A upper- and lower-case type face and read 
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by one of the smaller limited— font machines* The output will be recorded on 
digital magnetic tape for subsequent use by the computer. There are several 
other reasons why we prefer to use even a limited-capacity optical reader for 
Input rather than an on-line typewriter, punched paper tape, or punched cards. 
The first Is that much larger volumes of reading matter than we have used 
before must be fed Into the computer as rapidly as possible. A good typist 
can typically work on straightforward text more rapidly and accurately than 
can a keypunch operator. Moreover, If the work Is performed off-line. It need 
not occupy the computer during the text production process. Once a large 
volume of typescript has been prepared, an OCR reader can convert It Into an 
alpha-numeric code expeditiously and cheaply. A second reason for our Interest 
In OCR Input lies In the opportunity It provides to obtain some Introductory 
experience of current OCR technology so that we can better judge which machines 
and techniques best meet the needs of blind people and what problems still 
require solution. It Is already apparent that the specifications of OCR devices 
designed for commercial applications do not fully satisfy the requirements of 
reading machines for tha blind. For example, almost all of the commercial 
multi-font OCR development Is geared toward high-speed and high-accuracy operation 
on an Input medium for which some or all of the following features are closely 
specified: size and shape of the page, color, type styles, print quality, and 
the position of the printed text within the page. In contrast, a reading machine 
for the blind must be flexible with respect to each of these input specifica- 
tions. It is possible that by deliberate design this flexibility could be 
gained at the expense of recognition accuracy which, for a reading machine 
application, may be a little less stringent than that required for business 
purposes. However, because of the limited market potential of the blind commun- 
ity it is unlikely that the commercial sector will show an interest in solving 
their special problems in the near future. The solutions must be sought by 
tho55e who have a direct interest in the eventual development of automated reading 
services. 

The second step required to speed our production of reading materials is 
the addition of an automated dictionary for converting the alphabetic representa- 
tion of each word into its corresponding phonemic representation. In the compil- 
ation of our now-completed dictionary we are greatly indebted to the work of 
Dr. June Shoup of the Speech Conmun lea t ions Research Laboratory. This dictionary 
will initially include approximately 150,000 words but may have to be expanded. 
Optimum dictionary size can best be determined through actual use in a practical 
production system. Throughout this process the system's performance will have 
to be carefully watched to insure a proper balance between the size of the 
dictionary and search time or production rate. 

The third step entails automated stress and intonation marking. Following 
its assembly by the dictionary search routine, the phonemic string will be punc- 
tuated with stress and intonation symbols by a program based on the rules of 
Gaitenby, Sholes, and Kuhn (in press). In the final system the output from the 
dictionary and the stress assignment routines will be displayed on a storage 
oscilloscope and monitored by an editor-phonetician. Corrections of errors will 
be made by the editor, who will also note the circumstances in which errors 
occur. By this procedure we expect to produce useful text and at the same time 
to detect defects and omissions in the system and readily correct them. 

When the combination of phoneme string and stress markings has been formea, 
it will be processed by the speech synthesis program that is already in operation, 
converted to synthetic speech, and recorded. 
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EVALUATION STUDIES 



The completed production system will generate synthetic speech from text 
in sufficient quantity to meet the needs of an evaluation study. The purpose 
of the trial is to determine whether the operation of a Reading Service Center 
is economically feasible. The question of economic feasibility is rot easy to 
answer because of the intangible human values involved, but it is obvious that 
we need to identify costs and benefits as accurately as possible and to assess 
them in relation to available resources. To find the answers we need, we pro- 
pose to operate the production system we have just described, to perform some 
analytical tests, and to transcribe into synthetic speech some actual reading 
assignments required by blind students at the University. These assignments 
will be supplied in a manner similar to that in which existing services operate 
at the University to provide recordings of natural speech. By examining the 
way in which these materials are used, we expect to find the answers to two 
broad classes of questions. The first relates to human factors, the second to 
technical and economic factors. Figure 2 illustrates the areas to be explored. 

°^ factors, we are concerned with the relative comprehen- 

sibility of synthetic speech and natural speech over a range of speaking rates. 
The key question here is whether any differences in comprehensibility that may 
emerge are significant enough to affect the educational utility of synthetic 
speech. Analytical testing procedures will be used. The basic strategy for 
assessing comprehensibility involves the presentation of a lively passage of 
general interest followed by a series of questions which seek measures of the 
number of facts retained (i.e., names, places, distances, colors, etc.) and also 
the ability of the reader to derive logical inferences from the information. We 
propose to apply such tests using synthetic speech and natural speech controls 
(with appropriate counterbalancing) and then to compare the performance of the 
students. In a series of interviews designed to assess acceptability, we plan 
to gather data on such subjective factors as the relative preference for synthe- 
tic speech versus natural speech, the comparative comfort in use of the media, 
judgments regarding the aptness of different media for various fields of study, 
and the influence of delivery rate on all of these factors. 

In the area of technical factors, we are concerned with establishing an 
accurate assessment of the overall demand that a Service Center will be required 
to meet, the technical quality of the synthetic speech medium required to pro- 
duce acceptable performance at reasonable cost, the turn-around time which is 
both acceptable and economic, and the range of speaking rate required of the out- 
put. From these data an optimum equipment configuration can be determined and 
labor and operating costs can be estimated. 



CONCLUSION 



In this paper, we have argued that in order to provide better educational, 
vocational, and recreational opportunities for the blind population of this 
country, faster and more flexible reading services are required. Moreover, the 
technical resources are now available to supplement existing services through 
the use of read'ng machines located in Reading Service Centers. We believe that 
the time is now ripe to make a determined effort to move this technical capability 
out of the laboratory and into the community it could serve. The preliminary 
evaluative work we have described here is extensive and time consuming. Neverthe- 
less, it is essential that an exploration of the extent to which the results of 
our research meet the needs of blind people be carried on in parallel with 
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continued research. This strategy must be followed If we are to apply 
effectively a laboratory-developed technology to a soclo-economJc problem as 
complex as blindness. 
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Word and Phrase Stress by Rule for a Reading Machine 

Jane H. Gaitenby, George N. Sholes, and Gary M. Kuhn^ 
Haskins Laboratories, New Haven 



ABSTRACT 

A blind listener, using a reading machine that produces synthetic 
English speech, must receive an auditory version of a printed text 
that is intelligible, reasonably natural, and as fast as he likes. 
Good, fast, synthetic speech by rule (SSBR) has existed for some time; 
but stress and intonation tag assignment, being dependent partly upon 
the specific synthesis program used, has evolved slowly. The present 
stress assignment rules (written for Mattingly s SSBR) require a large 
stored dictionary, but the context rules ire fairly simple and work 
quite passably in the majority of English prose constructions. Th^se 
^ules are explained briefly (and, at the conference, were demon- 
strated in synthetic speech). 

The preceding paper by Dr. Nye and colleagues dealt with a plan for evalu- 
ating tV synthetic speech output of a reading machine for the blind — the 
machine designed and developed at Haskins Laboratories. Our purpose here is 
dual: first, to explain some of the technicalities of providing the machine 
with the capability of stressing itP worda and phrases in a manner appropriate 
to General American English and, second, to demonstrate the particular approach 
currently in use. 

By way of background, a short comparison of the reading process as done b> 
humans and as done by machine seems necessary. It is obvious that the goal in 
reading machine research on synthetic speech outputs is to produce clear English 
that is as acceptable to a blind listener as a reading made by a human would be. 
How does a htiman read aloud? No one knows exactly how thi3 is done, but an 
attempt at a quick description follows. 

The good human reader comes fore-armed to the reading task with years of 
experience in conversational English (listening and speaking). A graphic word 
is equivalent to a familiar spoken word to him, and the word has one or another 
familiar meaning depending on its context. The reader is accustomed to the 
normal stress patterns of English, and to phrase and sentence grammatical and 
intonational structure. Since the written word is merely a sjnnbolic substitute 



This is a revised version of a paper presented at the 1972 Conference on Speech 
Communication and Processing, Newton, Mass., 24-26 April 1972. It appeared as 
Paper A4 in the Conference Record . 

"^Also University of Connecticut, Storrs. 
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for the spoken word, using the same vocabulary and syntactic rules, the human, 
when reading aloud, utters the spoken words indicated in the text in the same 
serial order as they appear in print. As he reads aloud, he continuously scans 
the print ahead by eye, processes the visual information into chunks of meaning, 
and uses stress and intonation that are appropi-iate to the word order, to the 
punctuation, and to the vocabulary content. 

A requirement of an electronic apparatus that reads text aloud is that it 
produce verbal results similar to the speech of a good humar ^. Where 
possible, the reading machine has been equipped with paral] , .^n facul- 

ties. In place of eyes, an optical character recognizer wi^ . .^an the words 
of the printed page in serial order. Substituting for the human's ppeaking 
and listening experience, the large dictionary stored in the computer memory 
will match the scanned incoming words with their usual phonetic equivalents 
(in machine code), and the stress patterns/levels of words in probable stress 
contexts will be assigned by rule (to be described below) . From a cable of 
American English phonemes and a program that combines the phonemes into sylla- 
bles, the machine will digitally manufacture acoustic specifications for the 
words ot^ the text. Finally, the machine will provide intonation, by rule, to 
the sentences it generates in synthetic speech (taking assigned stress into 
account) . 

The machine is capable of converting print co speech rapidly, for hours or 
days at a time, without voice fatigue. And the machine's output words can be 
produced at any one of a range of rates—depending on the blind user's preferred 
listening rate. 

But the machine cannot think, in contrast to the skilled human reader who 
consults the deep structure of the language (that is, the meaning) as well as 
the surface structure (the phonetics suggested by the spelling, word order, 
punctuation, etc.) as he reads, the machine operates only on the surface struc- 
ture level. The machine is not equipped to make stress or intonational adjust- 
ments that are signalled by overall meaning. But the machine can be programmed 
to treat categories of words in special ways, and to modify stress in certain 
positional contexts. This being so, stress assignment by machine depends on 
classification of the English vocabulary by predictable or probable stress 
patterns. (That which is unpredictable is, ipso far to , of no utility to the 
machine.) 

Basic to English phrasal stress are the inherent (lexical) stress pattern^ 
of each polysyllabic word and the inherent stress level of each monosyllabic word 
(as a phrasal constituent) .2 The stress relationships of syllables within a 



The relative levels of sequential syllables within a polysyllabic word consti- 
tute its stress pattern. 

A monosyllabic word has, nf course, no internal syllabic stress contrasts 
(stress pattern) . The stress level of a monosyllabic word has been inferred 
(and assigned) on the basis of its normal (probablp) level within a normal 
phrase. The normal phrase can be construed as resembling a polysyllabic word: 
the phrase tpnds to have a stress pattern. The nhrase, however, is made up of 
free morpheme units and is consequently rather .^ss stable in its stress pattern 
than the polysyllabic word. 
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given polysyllabic word are ordinarily maintained in any sentence location and 
svntactic circumstance. English shows favorite word stress patterns: HIGH-LOW 
for two-syllable words, as in "cattle"; HICH-LOW-MID for three syllables, as in 
"catalogv . St l:ss patterns for longer words are usually predictable if 
stress-st - CLkbs such as "-ation" are present. However, there are numerous 
exceptior:? co the common English stress patterns (due largely to abnormal stress 
in some compound words and words borrowed from Romance languages). This fact 
has made it worthwhile to store an entire English dictionary, with each lexical 
stress indicated, in the memory of the machine. 

The stored dictionary contains a phonetic word in digital form (along with 
the lexical stress) to match each text word recognized by the optical scanner. 
(Rare proper names, recent coinages, and other words not contained in the diction- 
ary will be generated by letter-to-sound rules.) 

Before discussing the actual stress rules, Ignatius Mattingly's Speech 
Synthesis by Rule Program, which the word and phrase stress rules operate upon 
and within, will be briefly described. In order to illustrate the program for 
synthesis in the simplest way, it will be described in synthetic speech itself, 
with its stress, intonation, and phonetics generated entirely by rule. The 
textual input to the machine used in generating the demonstration tape was typed 
on a phonetic typewriter (simulating the print-to-phonetics conversion in the 
dictionary, as well as the automatic implementation of the stress rules). The 
typed phonetic data went directly to the computer where the synthesis program 
determined the computation of the acoustic features for the sentences. The 
computed material was then synthesized by a parallel formant generator. 

[SYNTHETIC SPEECH DEMONSTRATION 1] "This is the voice of the synthe- 
sizer at Haskins Labs. There are two main parts of Mattingly's 
Speech Synthesis Program. The first part consists of a table of 
standard American English phonemes, and the second part consists of 
digital instructions for combining the phonemes into syllables with 
reasonable intonation. There are four grades of stress possible in 
the program, of which three are being used in this demonstration. 
Mattingly^s Rules compute the intonation for each breath group on 
the basis of the punctuation given in the printed text input, and on 
the basis of the stresses assigned." 
(At the meeting it was then explained that the Mattingly program includes a choice 
of three intonational contours: the Fall, Fall-Rise, and th6 Rise.) 

[SYNTHETIC SPEECH DEMONSTRATION 2] (This consisted of words "yes" and 
"no" played in each of the three intonational contours, at three word 
rates.) 



To get to the stress rules, as mentioned above the stress assignment pro- 
cedure takes as its point of departure a dictionary that includes inherent stress 
along with the phonetic word, in machine language. (That is no small endowment;) 
The dictionary words are also tagged as members of stress categories that are 
compatible with the synthetic speech program. 

Three grades of stress are being used. Stress III (LOW), the unstressed 
grade, is realized in synthesis as low in pitch (Fq) and short in duration. 
Stress II (MID, or secondary stress) has longer duration than Stress III. 
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Stress I (HIGH, or primary stress) has the same relative duration as MID stress, 
but the fundamental frequency of a HIGH syllable is raised above the basic 
intonational contour. (Intensity is not manipulated in the SSBR program to 
indicate stress except in stop consonants. Intensity does change as a function 
of the spectral properties intrinsic to individual phones, however. In the 
case of vowels, in particular, this spectral property is in itself a strong cue 
to stress.) 



To group words into what has turned out to be five major word categories 
(and several subcategories) two features have been ascribed to each word: 
stress stability and probable stress level (stress level of the lexically 
stressed syllable, if there is one). Words that are classed as stable maintain 
their assigned stress in all contexts; unstable words alter their dictionary 
stress level in specified contexts, in specified ways. 

Stress Group A consists of many of the so-called function words—mostly 
monosyllables— (for the most part, these are personal pronouns, auxiliary verbs, 
and conjunctions). These words are usually unstressed in speech. They are 
among the 100 most frequent English words in texts and are of major importance 
to the rhythm and intonation of speech, even when they are barely audible. 
Words belonging to Group A are classed as unstable in stress. They are unstressed 
(LOW)— except in pre-pause position where they become MID. There are exceptions 
to this rule that apply to some sequences of function words such as two succes- 
sive prepositions (in which case the first pireposition receives more stress than 
the second). Prepositions and similar words are accordingly placed in a sub- 
category. Other exceptions are words like "the," "of," "and," "as," and "him." 
The phonetic shape of these words is dictated by immediate context, positional 
(e.g., initial in breath group) in some cases, and phonetic (e.g., followed by 
a consonant) in others. 

[SYNTHETIC SPEECH DEMONSTRATION 3] (Two sentences exemplifying 
stress Group A were played.) 



Stress Group B contains only a few words— some pronouns in the objective 
or possessive case, such as "me," "my," "their," "us," and several contractions: 
I 11, it s, etc.— words that are rarely final, or else very seldom stressed 
unless italicized. This group is classed as stable in stress and LOW. 

Group D contains a mixed bag of words from the point of view of gramma- 
tical role. (Group C will come later.) D is characterized as a stable stress 
group and HIGH. This means that the lexically stressed syllable in a D word 
receives HIGH stress in all locations. The words comprising this group are: 
all words of four or more syllables (a primary stress seems to be a requirement 
in long words); all spelled-out numerals (but the word "one" is a special case); 
all words of two or more syllables ending in "-ing" and "-ings" (except for the 
auxiliary verbs of Group A); polysyllables ending in "-ion" or "-ions," ending 
in -al or "-als," or ending in "-ic" or "-ics"; and a list of specific words 
(seventy-five such words at last count) most of which deal semantically with 
limit or extent, e.g., "both," "else," "fully," "rare," "similar," "single," 

entire. Also in Group D are all comparative and superlative adjectives. (A 
special feature of this group is that it causes an adjacent single noun to the 
right to lower its stress. This feature is actually stated as a noun stress 
shift rule.) 
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There are two more stress groups. Group E consists of all nouns not con- 
tained in Group D. Root (or present-tense) forms of verbs that also may func- 
tion as nouns are classed as nouns alone, since nouns as a class occur more 
frequently than do verbs. This group is unstable in stress. It receives MID 
stress when preceded by a Group D word as in "four books," "better idea," 
"every day." It also receives MID stress when preceded by its own class, E — 
as in "book store," "market basket," "cotton mill." Otherwise the words in 
Group E receive HIGH stress. (Proper names, capitalized initially, are an 
exception to this rule. The rightmost word of a string of initially capitalized 
words receives HIGH stress by rule; those at the left are assigned MID stress.) 

[SYNTHETIC SPEECH DEMONSTRATION 4] (Five sentences demonstrating 
Groups D and E were played.) 



The last general stress category. Group C, contains all the words not 
already classified. Group C thus contains a host of adjectives, adverbs, past- 
tense verb forms and past participles (many of which are also adjectives). 
This group can be viewed as intermediate on a scale of information content, 
generally less significant in a message than nouns and other semantically power- 
ful words. MID stress has proven appropriate for this group in all positions 
except pre-pausal, where a C word receives HIGH stress. Group C is thus 
unstable in stress. 

[SYNTHETIC SPEECH DEMONSTRATION 5] (Group C words were demonstrated 
in three sentences. All of the synthetic speech in Demonstrations 1 
through 5 was played at rates within a 12Qrl40 words per minute ran^e.; 



There are few words that do not fit fairly well, in practice, within one 
of these stress groups, but there are some words that require additional rules. 
Special rules are being developed as grouped, exceptions are noted in the course 
of producing texts for blind students. (Capitalized words have already been 
mentioned, and hyphenated words are another special case. A single hyphen, not 
line final, calls for a HIGH-MID stress sequence for the words it Joins. Two 
hyphens, joining three words, signal a HIGH-LOW-MID stress sequence.) 

[SYNTHETIC SPEECH DEMONSTRATION 6] (Examples of special cases were 
played . ) 



The criteria for establishing the membership of a word in one of the stress 
groups described are not elegant, it must be admitted. To recapitulate: 
function words fall within Group A; certain pronouns and contractions, for the 
most part, make up Group B; words of four or more syllables, comparative and 
superlative adjectives, numerals, and a list of special words comprise Group D; 
nouns not belonging to Group D fall into Group E; and all the remaining vords 
are gathered in Group C (with exceptions as noted earlier) . 

But if the categories are regrouped according to the stress grade initally 
assigned them, some order appears. Groups D and E, assigned HIGH stress (nouns 
and the like) are high in information and low in predictability. Most of these 
words occur as subjects or objects in sentences and are customarily stressed in 
the spoken chain. In contrast, the highly frequent and low in information words 



109 



of Groups A and B that have been assigned LOW stress are the barely audible 
connective tissue of spoken sentences. And Group C. with MID stress, contain- 
ing all the other words, consists chiefly of modifiers and verbs denoting com- 
pleted action (and these words may be considered intermediate in information 
content) . 

The five stress groups are rules for stress assignment in running speech, 
and they work well in view of their relative simplicity. Parsing is not 
necessary. (The inherent stress grade assigned to each word in the stored 
dictionary represents a form of parsing.) Tagging the thousands of words in 
the dictionary by stress type, however, has presented interesting problems in 
programming. At present the stored dictionary contains excess information. 
(The dictionary was "Inherited," so to speak, and came to Raskins with an 
embarrassment of grammatical riches attached.) 

In conclusion, here is one more synthetic speech sample using the stress 
rules (word categories and stress shifts due to context) that have been described. 
The text to be heard is one that was requested by a blind student for a reading 
assignment in psychology at the University of Connecticut. The tape will be 
played at 120 words per minute, then at 150, and finally at 200 words per minute. 

[SYNTHETIC SPEECH DEMONSTRATION 7] (A 132-word text sample was 
played.) 
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Auditory Evoked Potential Correlates of Speech Sound Discrimination 



Michael F, Dorman 

Haskins Laboratories, New Haven 



Numerous studies have indicated that the sounds of speech enjoy a special 
mode of perception, distinct from that of nonspeech signals (Liberman, 1970; 
Liberman, Cooper, Shankweiler, and Studdert-Kennedy, 1967). One set of inves- 
tigations supporting this view has examined the relationship between identifi- 
cation and discrimination of speech and nonspeech signals. Listeners can dis- 
criminate many more nonspeech stimuli than they can identify absolutely (Miller, 
1956; Pollack, 1952). However, certain speech sounds, the stop consonants 
[b,d,g,p,t,k], tend to be discriminated no better than they can be identified 
(Pisoni, 1971; Studdert-Kennedy, Liberman, Harris, and Cooper, 1970). This 
unique relationship between identification and discrimination is termed 
"categorical perception." 

In a typical experiment, Lisker and Abramson (1970) presented to Ss for 
identification and discrimination a series of computer-synthesized stop conso- 
nants which differed solely along the physical continuum of voice onset time 
(VOT).l Listeners identified thfse stimuli exclusively as members of the pho- 
netic category [ba] or [pa], ^s discriminated almost perfectly between stimuli 
which were arsigned to different phonetic categories. However, when physically 
different stimuli were drawn from the same phonetic category, discrimination 
was only slightly better than chance. Thus, equal acoustic differences (for 
example, 20 msec) along the VOT were not equally discriminable. Only when 



This paper reports a portion of the research carried out for a Ph.D. disserta- 
tion accepted by the University of Connecticut in 1971. 

^Currently Herbert Lehman College of the City University of New York. 
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VOT refers to the relative timing of the release of supraglottal closure and 
the onset of laryngeal pulsation or "voicing." Abramson and Lisker (1970) have 
argued that the acoustic features of explosion energy, amount of aspiration, 
and f irst-formant intensity may all be derived from the single articulatory 
variable of VOT. In sound spectrograms VOT is reflected by the onset of the 
first formant relative to the second and third fcrmants and, for stop conso- 
nants with a delayed onset of the first formant, the presence of aspiration in 
the upper formants in the period preceding the onset of voicing. 
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earlobe. Resistance between electrodes was always less than 5K Ohms. 



The EEG signals were transmitted by telemetry (Narco FM-1100-E3) to AC 
preamplifier (W-P Instruments DAM 6) and oscilloscope amplifier (Tektronic 
RM 502A) which also served to monitor the EEG. The frequency response of the 
system after amplification was flat, between 2.0 and 30 Hz. The amplified EEG 
was stored for later analysis using a Vetter FM Adapter (FM-3) and a Sony 355 
tape deck. 

The extraction of the evoked response from the EEG was carried out both on- 
and off-line by a computer of average transients (Fabri-Tek 1072). The sweep 
duration was one second. The averaging cycle of the computer was triggered by a 
pulse from the second channel of the stimulus presentation tape. The onsets of 
the cuing pulses and the synthetic speech stimuli were simultaneous. The AER 
records were written out on an X-Y plotter (Hewlett-Packard 7035b) . 

Stimuli. The three synthetic, stop consonant-vowel syllables used in this 
study are shown in Figure 1. These stimuli were generated on the Haskint, 
Laboratories computer-controlled parallel-resonance synthesizer (Cooper and 
Mattingly, 1969). 

The three stimuli differed solely along the VOT continuum: 0 msec VOT 
(0 VOT) ; 20 msec VOT (20 VOT) ; and 40 msec VOT (40 VOT) . Stimulus duration was 
250 msec. For stimulus 0 VOT, the onsets of the first (Fl), second (F2), and 
third (F3) formants were simultaneous; for stimulus 20 VOT, Fl began 20 msec 
after F2 and F3; for stimulus 40 VOT, Fl began 40 msec after F2 and F3. Aspir- 
ation was added to the upper formant frequencies during the period of Fl delay 
for stimuli 20 and 40 VOT. Thus, each adjacent pair of stimuli along the VOT 
continuum differed by exactly 20 msec VOT (i.e., 20-0 VOT and 20-40 VOT). 
Previous identification studies have indicated that stimuli with 0 and 20 VOT 
are identified as members of the phonetic category [ba] and that stimulus 40 VOT 
is identified as a member of the phonetic category [pa] (Lisker and Abramson, 
1970).^ Discrimination tests have indicated that the pair 20-40 VOT is discrim- 
inated essentially perfectly. The pair 20-0 VOT is discriminated just slightly 
better than chance (Abramson and Lisker, 1970). In the following account, 
stimulus 20 VOT will be termed the "standard" stimulus, stimulus 0 VOT the "within- 
category" shift stimulus, and stimulus 40 VOT the "across-category" shift stimulus. 

Preparation of the stimulus tapes . With the aid of the computer-controlled 
synthesizer four stimulus sequences were recorded on audio tape. Two of the 
stimulus sequences were composed of varying length runs of standard stimuli 
(20 VOT), separated by pairs of either within- or across-category shift stimuli. 
There were a total of 154 standard stimuli and sixteen pairs of shift stimuli 
in each sequence. The pairs of shift stimuli occurred on the average once every 
ten successive standard stimuli (range 6-14). In one sequence the pairs of 



The three synthetic speech stimuli used in this study were slight modifications 
of the stimuli used by Lisker and Abramson (1970). Informal listening tests by 
the author and his colleagues indicated that the 20 VOT stimulus used in the 
present study was labeled more consistently as a [ba] than the 20 VOT stimulus 
used by Lisker and Abramson. These tests also indicated that the 20 VOT stimulus 
was discriminated less often from the 0 VOT stimulus than was the corresponding 
stimulus used by Lisker and Abramson. 
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shift stimuli were within-category stimuli; in the other, across-cacegory stimuli. 
A third stimulus tape consisted of a single sequence of 186 standard stimuli. 

The fourth stimulus sequence contained an alternating sequence of blocks 
of ten within-category stimuli and ten across-cafegory stimuli separated by 
30-sec interblock intervals. There were three blocks of each shift category. 
The interstimulus interval (onset to onset) for all sequences was 2 sec. 

Design . The Ss were assigned to five groups (ten S^s per group). The groups 
were run successively. The experimental task for the Ss in Groups 1, 2, 3, and 4 
was to detect the occurrence of shift stimuli embedded in the sequence of standard 
stimuli. 

The S^s in Group 1 listened first to the within-category shift sequence 
(20-0 VOT), then, on the following day, to the across-category shift sequence 
(20-40 VOT) . The S^s in Group 2 also listened to both sequences on successive 
days, but in the reverse order. 

Group 3 was given twenty practice trials with both the standard and within- 
category stimuli before listening to the within-category shift sequence. Pretrain- 
ing consisted of twenty presentations of a group of four stimuli; two standard 
stimuli followed by two within-category stimuli. The interval between the groups 
was 5 sec. The Ss were told the order of the different stimuli and were instructed 
to try to detect any difference between the sounds. The within-category shift 
sequence was begun immediately after pretraining. These S^s were given pretraining 
to determine whether increased familiarity with the "unfamiliar" nonphonemic 
distinction would improve performance. 

In a no-shift condition (Group 4) the S^s listened to the tape which con- 
tained all standard stimuli. The purpose of this control was to establish a 
baseline from which to assess the effects of the different shift conditions. In 
the other control condition (Group 5) the S^s listened to the randomized sequence 
of blocks of within- and across-category stimuli (the fourth stimulus sequence). 
The purpose of this control was to determine the amplitude of the AER to the 
across- and within-category stimuli in a setting unrelated to the discrimination 
tasks and thus to assess the "inherent" amplitude of the AERs to the 0 and 40 
VOT stimuli. 

Groups 3, 4, and 5 were tested in a single session. The session duration 
was approximately 7 minutes. 

Analysis of the evoked potentials . The amplitude differences between the 
Nl and P2 responses was determined from the X-Y plots by measuring the difference 
in millimeters between the maximum wave of negativity between 75 and 125 msec 
after stimulus onset (N] ) and the maximum wave of positivity between 175 and 
225 msec (P2). 

Each AER was the sum of sixteen individual responses. Responses to the 
standard and shift stimuli were averaged separately in all conditions. A 
separate AER was accumulated for each member of the shift pairs. The AER to the 
last standard stimulus before the shift pair was designated as the AER to the 
standard stimulus. In the no-shift condition (Group 4) evoked responses were 
ac::umulated for the stimuli which occurred in the same positions as the standard 
and shift stimuli in the shift conditions. For the stimulus control condition 
(Group 5) separate evoked responses were accumulated for the within- and across- 
category stimuli by summing over blocks of trials. 
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Procedure. All were instructed to remain as motior^ess as possible 
during the experiment and to fixate on a point in front Oi. them. The ^s in 
Groups 1, 2, 3, and 4 were instructed to "listen for" the occurrence of "any 
change" from the standard stimuli. The S^s were not told which pair of shift 
stimuli would occur in a given test sequence. The ^s in Group 3, after practice 
with the within-category and standard stimuli, were told to "listen for" the 
same changes in the test sequence as they had listened to in the practice 
sessions. The S^s in Group 5 were told that they would hear separate blocks 
of [pa] and [ba] and were instructed simply to listen to the stimuli. 



RESULTS 



Amplitude of N1-P2. For each S, the amplitude scores for both shift 
stimuli were expressed as the ratio of the shift stlmxilus applitude to the 
standard stimulus amplitude. A ratio score of 1.0 indicated that the amplitudes 
of the standard and shift stimuli were identical. A ratio score greater than 
1.0 indicated a larger shift stimulus amplitude than standard stimulus amplitude, 
For the ^s in Group 1 (across shift then within shift) and Group 2 (within shift 
then across shift) separate ratio scores were computed for the within- and 
across-category shift conditions. The ratio scores for Groups 1-4 collapsed 
across Ss are shown in Table 1. 



TABLE 1 



Average Ratio of the Standard Stimulus N1-P2 Amplitude to the N1-P2 
Amplitude of the Shift Stimuli 



Shift Category Position in Shift Pair 



2nd 



Group 1 1st 

Across 1.36 i.go 

0.8'i 



Within 0.92 
Group 2 

Within 0.95 0.90 

Across 1,35 1,51 

Group 3 

Pretrained Within 0.92 0.90 
Group 4 

No Shift 0.95 0.90 



For Groups 1 and 2, the effects of presentation order (within shift then 
across shift vs. across shift then within shift), shift type (within vs. across), 
and location in the shift pair (first vs. second) were compared in an analysis 
of variance. A reliable main effect due to shift type was obtained 
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* 25.00, £ < .01; X (within shift) - .91, X (between shift) = 1.45]. No 
other main effects were significant. A shift type x location interaction was 
also obtained (|2,18 ^ ^^66, £ < .05). 

The difference in N1-P2 amplitude to the within- and across-category shifts 
is illustrated for a representative in Figure 2. In the across-category shift, 
the amplitude of both members of the shift pair (BSl and BS2) exceeded that of 
the standard stimulus (S) . For the within-category shift, neither member of the 
shift pair was larger than the standard stimulus. 

Since the analysis of variance showed no significant effect due to presen- 
tation order, the data for the within- and across-category shifts were pooled 
over Groups 1 and 2. Two additional analyses of variance were then computed 
with the pooled data. 

The first analysis compared the pooled across-category shift condition from 
Groups 1 and 2 with Group 3 (pretrained within-category shift) and Group 4 
(no shift). In the groups x location analysis of variance only the groups effect 
was significant (^^^37 = 13.16, £< .01). Post hoc comparisons according to 
Scheffe revealed that the pooled ^^.cross-category shift condition (X = 1.46) 
differed from both the pretrained within-category condition (X = .91) and the no- 
shift condition (X = .92) at the .05 level. A second analysis of variance com- 
pared the pooled within-category shift condition from Groups 1 and 2 with 
Groups 3 and 4. The analysis of variance showed no reliable effects. 

For Group 5, the absolute (N1-P2) amplitude difference of the AERs to the 
within-category stimulus (0 VOT) and to the across-category stimulus (40 VOT) 
were compared by a correlated t^-test. The amplitudes of the two stimuli were 
not significantly different (T9 = 1.01, n.s.). 

DISCUSSION 

The comparison of the within- and across-category shift conditions demon- 
strated that the across-category shift (20-40 VOT) elicited a larger N1-P2 
response than the within-category shift (20-0 VOT) . The difference in N1-P2 
amplitude In the two shift conditions cannot be attributed to an "inherently" 
larger N1-P2 response to the across-category stimulus (40 VOT) than to the within- 
category stimulus (0 VOT), since in the stimulus control condition (Group 5) the 
amplitude of the N1-P2 response to 0 VOT [ba] and to 40 VOT [pa] did not differ. 
This outcome suggests that the difference in N1-P2 amplitude in the within- and 
across-category shift conditions was due to the difference in discriminabillty 
of the two types of shift. 

The comparison involving the within-category shift group and the no-shift 
control (Group 4) revealed that the N1-P2 response in the two conditions did 
not differ. Furthermore, pretraining with the within-category and standard 
stimuli (Group 3) did not alter the amplitude of the N1-P2 response in the within- 
category shift situation. 

Thus, the behavior of the N1-P2 component of the AER, under the conditions 
of the present study, mirrored the relative discriminabillty of the stop conso- 
nant pairs. 
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Auditory to phonetic recoding . The "categorical" response of the N1-P2 
component of the AER suggests that within 100-200 msec after the onset of a 
stop consonant, the finely detailed acoustic stimulus has been receded into a 
categorized phonetic representation. The data from the present study do not 
support the suggestion that a categorical response is generated at a "long" 
interval after stimulus onset as a function of an arbitrary labeling of two 
discriminably different stimuli as belonging to the same phonetic category. 

This interpretation of the data bears directly on the nature of the pro- 
cessing of the highly encoded stop consonants. After a stop consonant has been 
recoded into a categorized phonetic representation, a listener knows very little 
about the derailed acoustic structure of the aufUory signal (e.g., VOT). The 
processing mechanism for the stop consonants appeals to act like a "digitizing" 
device, accepting as input a highly variable and finely detailed auditory signal 
and then rapidly recoding it into a quantized phonetic representation (Mattingly 
et al., 1971). After recoding, the detailed auditory information does not seem 
to be stored in any accessible form. 

A,^^ ^^^^ interpretation of the data is supported by two recent studies exploring 
differences in the processing of stop consonants and steady-state vowels. 
Crowder (1971) using a serial recall task found that if the vowel portions of 
CV syllables were varied in a serial list, then a large recency effect was obtained 
during recall. If, however, the consonant portions of the syllables were varied 
in the lists, than no recency effect was obtained. If the recency effect is 
contingent upon an "echoic" or "precategorical" acoustic memory store of 2-3 sec 
duration, as Crowder and Morton (1969) have suggested, then the representation 
of a stop consonant does not persist 2-3 sec in "precategorical" auditory memory. 

The life span of auditory memory for stop consonants has also been studied 
using recognition memory tasks. In one of a series of studies, Pisoni (1971) 
varied the interval (0, .25, .50, 1.0, 2.0 sec) between vowel pairs and stop 
consonant-vowel pairs in an A-X discrimination paradigm. The discrimination of 
vowel stimuli was markedly affected by the A-X interval, with longer intervals 
producing poorer discrimination. Stop consonant discrimination, however, was 
relatively unaffected by A-X interval. Pisoni concluded that "information other 
than a binding phonetic categorization is unavailable for use in discrimination 
lof stop consonants]." The results of the present study are in complete agree- 
ment with those of Pisoni (1971) and Crowder (1971) and further reinforce the 
notion of a special mode of processing for the stop consonants characterized by 
the absence of a persistant noncategorical auditory image. 
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Short-Term Habituation of the Infant Auditory Evoked Response* 
Michael F. Dorman^and Robert Hoffmann^ 



In adults, the amplitude of the auditory evoked response (AER) at the 
vertex decreases as a negative exponential function of the number of stimulus 
presentations; It decreases faster, the faster the stimulus presentation rate, 
and recovers spontaneously when stimuli are withheld (Fruhstorfer , Soverl, and 
Jarvllehto, 1970). The decrease In AER amplitude reaches asymptote by the 
third to fifth presentation of a stimulus In a train (Rltter, Vaughn, and 
Costa, 1968; Fruhstorfer et al., 1970). Fruhstorfer (1971) has argued that 
the observed short-term reduction In AER amplitude over the first three to five 
presentations of a stimulus In a train Is an Instance of habituation (Thompson 
and Spencer, 1966). 

In Infants, habituation to stimuli In the auditory modality has been 
difficult -o demonstrate (Jeffrey and Cohen, In press) . The present study used 
a short-cerm habituation paradigm similar to that of Fruhstorfer et al. (1970) 
to Investigate the effects of repeated stimulus presentation on the amplitude 
of the Infant vertex AER. At the same time, the study served to establish an 
efficient methodology for collecting reliable AERs from awake Infants. 

METHOD 

Subjects . A total of nine Infants completed all of the conditions of the 
study. Artifact-free AERs were obtained from six (five male, one female) of 
the Infants. All of these S^s were between 10 and 14 weeks old. 

Apparatus. Recording of the electroencephalogram (EEG) was made from the 
scalp using a single slllver-dlsk electrode located at the vertex (Jasper, 1958) 
which was referenced to the right ear lobe. Electrodes were attached to the 
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scalp by styrofoam adhesive pads and an elastic headband. Electrode Impedence 
was less than 6K Ohms. 

The EEG signals were transmitted by telemetry (Narco FM-1100-E3) to an AC 
preamplifier (W-P Instruments DAM 6) and an oscilloscope amplifier (Tektronlc 
RM 502A) which also served as a monitor. The frequency response curve after 
amplification was flat, between 2.0 Hz and 30 Hz. The amplified EEG was stored 
on tape for later analysis using a Vetter FM-3 Recording Adapter and Sony 355 
tape deck. ' 

The extraction of the evoked response from the EEG was carried out on- and 
off-line by a computer of average transients (Fabrl-Tek 1072). The sweep dura- 
tion was 1 sec. The averaging cycle of the computer was triggered by a pulse 
from the second channel of the stimulus presentation tape. The onsets of the 
cuing pulses and the synthetic speech sounds were simultaneous. The AER records 
were written out on an X-Y plotter (Hewlett-Packard 7035b). 

Stimuli. The stimuli used In this study were trains of the stop consonant- 
vowel syllable [ba]. The duration of the syllable was 250 msec; the rise time, 
25 msec; the Intensity, 65 db SPL. The stimuli were generated on the Hasklns 
Laboratories computer-controlled speech synthesizer (Mattlngly, 1968). 

Design and procedure. During a session, fifteen trains of four stimuli 
were presented at a rate of 1 traln/30 sec from an AR 4-X loudspeaker placed two 
feet In front of the S.-"- The repetition rate of the stimuli was 1 stlmulus/2 sec. 

The were held In their mother's lap and were either bottle or breast fed 
during the test session. The mothers were Instructed to hold the Infants as 
quietly as possible and not to move the Infant's bottle during presentation of 
the stimuli. 

Analysis of the AERg. The amplitude of the N1-P2 response was determined 
from the X-Y plots by measuring the difference In millimeters between the maximum 
peak of negativity between 75 and 150 msec after stimulus onset (Nl) and the 
maximum peak of posltlvlty between 175 and 275 msec (P2) . The responses to each 
member of the stimulus train were averaged separately. Ten good responses (I.e., 
those with no movement artifacts) were accumulated for each average. 

RESULTS 

The amplitude of the N1-P2 response as a function of the position of the 
stimulus In the train Is shown In Figure 1. The amplitudes of the second, third, 
and fourth stimuli In the train are expressed as a percentage of the first 
stimulus amplitude. The mean amplitudes of the second, third, and fourth stim- 
uli In the train were 36. OX, 41.0?;, and 21.7% of the first stimulus amplitude. 
All amplitude reductions were significantly different from the first stimulus 
amplitude (£<0.01) using a rank sum test. 



^The stimuli were never presented when an infant was active or fussing. Thus, 
on a number of trials for all Ss, an intertrain interval of greater than 30 sec 
was used. 
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Little difficulty was encountered in collecting artifact-free AERs from 
the awake infants. As long as the infants were brought into the laboratory 
hungry and were fed during the recording session, artifacts due to infant move- 
ment were nil. The use of FM telemetry rather than long cables to convey the 
EEG data to the recording apparatus also helped minimize movement artifacts. 

The N1-P2 amplitude of the vertex AER in the awake infants decreased 
rapidly as a function of the repeated presentation of the syllable [ba] in a 
stimulus train. The magnitude and time course of the decrease in N1-P2 ampli- 
tude of the infant AER is consistent with the findings of both Ritter et al. 
(1968) and Fruhstorfer et al. (1970) on the short-term habituation of the adult 
AER. However, because of the differences in the interstimulus and intertrai: 
intervals between the present study and the previously cited studies with adults, 
the rates of habituation of the infant and adult AERs cannot be directly 
compared . 

When the stimulus train was withheld during the 30-sec intertrain interval, 
the N1-P2 amplitude recovered spontaneously. This was evidenced in the absolute 
N1-P2 amplitude to the first and fourth members of the stimulus train. Thus, 
the decrease in the amplitude of the N1-P2 components of the infant AER in 
response to the repeated presentation of the syllable [ba] satisfies two of the 
characteristics of short-term AER habituation (Fruhstorfer, 1971). 

In adults, a habituated AER to the syllable [ba] can be at least partially 
dishabituated by the presentation of a novel syllable [pa] (Dorman, in prepara- 
tion) . The results of the present study suggest that the AER could serve as a 
useful dependent variable in studying the perceptual abilities of awake infants. 
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Early Apical Stop Production: A Voice Onset Time Analysis 
Diane Kewley Port"^ and Malcolm S. Preston"*^ 



ABSTRACT 

Voice onset time (VOT) has been shown to effectively differen- 
tiate the phonemic categories of stop consonants along the voicing 
dimension. This study applied the measurement of VOT to the produc- 
tion of apical stops produced by young children acquiring American 
English. Stops were measured from three children who were recorded 
regularly between 1 and 2 years of age and from additional children 
ranging In age from 6 months to 4-1/2 years. Distributions of the 
percentage of occurrence of apical stops along the VOT continuum are 
compared longitudinally across subjects as well as with distributions 
of adult productions of word- Initial /d/ and /t/. Drawing on a phys- 
iological discussion of the control of timing between the stop 
release and the onset of vocal fold oscillation, the following pat- 
tern of apical stop development is proposed. The earliest Instances 
of stop articulation, around 6 months of age, have uniform distribu- 
tions along the VOT continuum. Ac a later stage the distribution of 
apical stops collapses Into an Interval corresponding to that of the 
adult production of /d/. With further development some apical stops 
are added In the range of adult /t/. The distributions of /d/ and 
/t/ words for children do not change from 2 to 4-1/2 years, but they 
do not yet correspond with those of adults. 
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INTRODUCTION 



This investigation applies acoustic measurement techniques to a develop- 
mental study of the production of stop consonants. The measure selected fcr 
this project is voice onset time (VOT), which is defined as the time interval 
between the release of stop occlusion and the onset of vocal fold oscillation. 
VOT can be easily measured from spectrograms of adult consonant-initial vocal- 
izations. VOT measurements roughly comparable to those of adults can be made 
from spectrograms of the vocalizations of young infants if criteria for the 
selection of stop consonants are carefully applied. 

Using VOT measurements, the present study investigates the development of 
stop consonants for three children from 1 to 2 years of age. This report is 
limited to apical stops because the children in our sample produced apicals 
almost exclusively during the time period studied. These longitudinal data are 
supplemented with other data which include a brief study of words for children 
from 2 to 4-1/2 years of age. 

Linguists have claimed that voicing is a primary phonetic dimension for 
distinguishing among categories of stops produced at the same point of articu- 
lation. The voicing dimension for stops has been related . many different 
acoustic and articulatory phenomena. Lisker and Abramson (1971:770) have stated 
that voice onset time is "the single most effective measure" for sorting stops 
into different phonemic categories with respect to voicing, either productively 
or perceptually. Their own studies have repeatedly given support to this claim 
for production (Lisker and Abramson, 1964, 1967, 1970) and perception (Abramson 
and Lisker, 1965, 1970) across different languages. The measure of VOT is, how- 
ever, the manifestation of a complex interaction between supralaryngeal and 
laryngeal musculature used to produce stops. This paper will consider in detail 
the physiology of the production of stop consonants and its relationship to VOT 
measurements. Evaluating the data in the context of these discussions, a sequen- 
tial pattern of the development of apical stops with respect to VOT is proposed 
covering birth to 4-1/2 years of age, 

PROCEDURE 

The primary data of this study consisted of three sets of tape-recorded 
sessions, each set corresponding to one of three normally developing children 
(E3, E4, and E7) from American English-speaking environments. Tape recordings 
of E3 were analyzed at 45, 51, 60, 73, 81, 97, and 101 weeks of age. For E4 the 
ages of analysis were 50, 64, 82, 9b, 111, and 125 weeks. For E7 sessions were 
analyzed at 34, 40, 51, 64, 75, 83, and 96 weeks. These ages were chosen to 
correspond across subjects at roughly 12-week (3-month) intervals. Recordings 
having the greatest amount of vocalization were chosen when more than one record- 
ing was available for a time period. 

E3 was a male, while E4 and E7 were females. E3 and E4 were the children 
of medical residents at the Johns Hopkins Hospital while E7 was the child of a 
senior undergraduate at the Johns Hopkins University. Thus, all three came from 
educated, middle-class families. Except for occasional colds, the three infants 
were in good health over the period during which the recordings were made. 

The tape-recording sessions were conducted in a sound-isolated booth (TAC 
model 1203) with the mother or father and occasionally an experimenter present* 
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The instructions to the parents were simply to encourage the child to vocalize 
as much as possible. Quiet toys and objects of interest were present during 
the recording sessions, which generally lasted about 30 minutes each. The 
children's vocalizations were recorded at 7-1/2 ips on an Ampex tape recorder 
(model AG350). A condensor microphone and cathode follower (Bruel and Kjaer 
models 4131 and 4133) were connected by cable to the tape recorder outside the 
booth. 

The procedure for analysis involved a transcription of the entire session^ 
using a modified version of the Peterson-Shoup articulatory phonetic theory 
(Peterson and Shoup, 1966). Although phonetic transcriptions of infant vocaliz- 
ations are obviously necessary to identify the stop consonants appropriate for 
measurement, the referent of any symbol in that transcription is unclear. 
Phonetic theories, such as that of Peterson and Shoup (1966), have been developed 
for the purpose of describing the phone types of the linguistic vocalizations of 
adults and are based on substantial knowledge of adult acoustics and articulation 
and of the correspondence between the two. However, far less is known about the 
articulatory or acoustic properties of the vocalizations of infants, nor is any- 
thing known about the reality of the articulatory mechanisms implied by adults 
ascribing phone types to the vocal sounds of such young children. Hence, at best 
our phonetic transcriptions must be considered to be a set of adult phone types 
which seemed most similar to the vocalizations produced by our infant subjects. 
It is our belief, however, that our phonetic transcriptions are adequate for the 
purpose of reliably identifying initial stop consonants produced by young 
children. 

Another problem encountered was to select from the children's recordings 
a set of vocalizations which would be at least roughly comparable to words with 
initial stop consonants, as spoken by adults. In order to do this, rigorous 
selection criteria were developed based on the articulatory parameter values of 
the Peterson-Shoup theory. A vocalization was considered for analysis as long 
as its initial portion consisted of a stop consonant and a vowel. The trans- 
criber then carefully judged each one as follows. 

For the stop, the primary parameters required were plosive, alveolar 
(apical), and stop. The secondary parameters specified were: pulmonic air 
mechanism, egressive air direction, nonf rictional airflow, oral airpath (non- 
nasal), nonlateral lingual air path, open pharynx shape, natural tongue body 
shape (nonpalatalized and nonverlarized) , and nonretro flexed tongue apex. Because 
infants exhibit a notable lack of control with reference to several secondary 
parameters, flexible criteria were used. Air pressure, whether lenis, normal, or 
fortis, was not judged except where it might have contributed to an excessively 
f rictional airflow. The type of release, 33 relating to aspirated, unaspirated, 
or phonoaspirated stops, and lip shape were not judged. Laryngeal action was 
judged only for the vowel. 

For the vowel, any horizontal and vertical place of articulation with pul- 
monic air mechanism, egressive air direction, and nonf rictional airflow was 



^Two sessions, which had unusually large numbers of infant vocalizations, were 
only partially transcribed. 
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accepted. Air pressure, general air path, lingual air path, pharynx shape, 
tongue shape, apex shape, and lip shape were not Judged. Vocal fold oscillation 
presented a special problem for Infants. It would have been Impossible to 
choose as normal any particular kind of oscillation, since vocal fry and falsetto 
voice were frequently produced by all three Infants. Thus the laryngeal actions 
accepted Included breathy, voiced, laryngeallzed, pulsated, and phonoconstrlcted; 
however, voiceless, whispered, constricted, and stopped laryngeal actions were 
not accepted. Further consideration was not given to portions of a vocalization 
following the stop and vowel. 

Following transcription, wide-band spectrograms were made of tha selected 
vocalizations on a sound spectrograph (Volceprlnt model 4691A). To facilitate 
the measurement process, the vocalizations were always analyzed at half speed. 
Using the spectrograms, stops were categorized as Initial In a small number of 
ambiguous cases by assuring that the stop was preceded by a pause of at least 
50 msec. Stops were discarded where the onset of voicing or release was diffi- 
cult to Identify on the spectrogram. 

Measurements of VOT to the nearest 10 msec were taken directly from the 
spectrograms. VOT Is measured as the Interval between the first vertical strla- 
tlon representing glottal pulsation and the onset of energy ("burst") represent- 
ing the release of stop occlusion. When the glottal pulses precede the stop 
release (voicing lead), the VOT value Is given a negative sign; when the stop 
release precedes the glottal pulses (voicing lag), the VOT value Is positive. 

A second experimenter checked the VOT measurements and further eliminated 
any Items which In his opinion did not meet the above criteria. Thus, only 
sounds which were clearly Identified as apical stops In Initial position and 
which could be measured for VOT were Included In the final analysis. The number 
of apical stops per session Included In the final analysis varies from thirteen 
to ninety-eight. However, only three of the total twenty sessions had fewer 
than twenty tokens. 

RESULTS 

Figure 1 presents the combined data for each of the three subjects In the 
form of frequency distributions covering the entire period Investigated. Com- 
parison distributions for adults borrowed from the work of Llsker and Abramson 
(1967:13) are also presented In Figure 1. Their two distributions are derived 
from sentences, some of which contained words starting with the phonemes /d/ or 
Itl , spoken by ten American English speakers. The data for each child are 



The adult distribution presented In the figures of this paper combines the VOT 
values for both stressed and unstressed words and represents what we could con- 
sider to be the model of adult /d/ and /t/ distributions presented to the child 
In normal speech. On the other hand, distributions for words In only the 
stressed position should correspond more closely to the Isolated stop-lnltlal 
utterances collected from the children. The differences between the two types 
of distributions are small: In particular, for stressed words there is a better 
separation between the VOT values for /d/ and /t/, and the mode for Itl Is 
greater, +50 msec vs. +40 msec. 
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presented as a single distribution since there was no way to assign phonemic 
units to their babbling. Each of the distributions produced by the children 
should be compared separately with the distributions of /d/ and /t/ xor the 
adults as well as with each other. It Is evident that the children's distribu- 
tions are remarkably similar to one another. Each has a single wode^ and the 
majority of the productions fall In the 0 to +20 msec voicing lag region. 

To facilitate comparison of the children's data with those of adults, we 
Introduce some terminology from the studies of Llsker and Abramson (1964, 1970, 
1971). In their cross-language studies of Initial stop consonants, three cate- 
gories of stops having a rough correspondence across languages emerge along the 
voice onset time continuum. The categories are defined as follows; voicing 
lead, where stops have negative VOT values; short voicing lag , where stops have 
VOT values from 0 to +20 msec; long voicing lag , where stops have VOT values 
greater than +40 msec. As Figure 1 shows^ measurements of American English 
apical stops produce two partially overlapping frequency distributions with a 
boundary between +20 and +30 msec. The majority of VOT values for /d/ lie In 
the short voicing lag category, although a small percentage occur In the voicing 
lead category. With respect to American English, It will sometimes be conven- 
ient to use the term "d-range" to refer to VOT measurements of +20 msec and less. 
Similarly, the term "t-range" will refer to VOT values of +30 msec and greater, 
noting that most values for /t/ lie In the long voicing lag category. The 
d-range and t-range, as defined, reflect a basic attribute of the voice onset 
time models which the child will eventually acquire for distinguishing words 
beginning with /d/ and /t/, namely that values along the voice onset time con- 
tinuum are divided Into two reasonably distinct classes. 

A comparison of the children's data with the adult phonemic data suggests 
that the children reflect the English use of both /d/ and /t/. Only about 5% of 
the apical stops have voicing lead, whereas 64% are In the short voicing lag 
category and 31% In the long voicing lag category. Thus, during the period 
covered for each child, there are productions falling In both the d-range and 
t-range of VOT with a distinct preference for the d-range at approxlmater.y a two- 
to-one ratio. The children's distributions are unlmodal In contrast to the 
adult data which. If combined Into a single distribution, would show two modes, 
one for each category of apical stop. 

Figures 2, 3, and 4 present the data arranged In longitudinal fashion for 
E3, E4, and E7, respectively. Each distribution In these three figures corre- 
sponds to a recording session at a single age going from youngest at the bottom 
to oldest near the top. The Llsker and Abramson data for adults are again repro- 
duced at the top of each figure. 

Inspection of the data for E3 at 45 and 51 weeks shows a concentration of 
apical stops In the short voicing lag category with only a few tokens In the 
long voicing lag category. By 101 weeks, E3 shows a considerable number of long 
voicing lag stops ranging from +30 to +160 msec, with no preference for any 
particular value. There are almost no stops In the voicing lead range at any 
age. The mode of all the distributions remains at +10 msec VOT with the excep- 
tion of 97 weeks where It lies at +20 msec VOT. 

For E4 the developmental pattern Is much like that of E3. At 50 weeks of 
age, there Is a concentration of short voicing lag stops; although stops do occur 
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in the other categories. By 96 weeks, E4 produces a considerable number of long 
voicing lag stops and continues to do so at 111 and 125 weeks. Again few stops 
with voicing lead occur. Distributions have a single mode that ranges from 0 to 
+20 msec. 

The developmental sequence for E7 contrasts in certain respects with that 
of E3 and E4. First, E7 was recorded at a much earlier age than E3 or E4. The 
distributions for the earliest recording sessions (34 and 40 weeks), for which 
no comparable data exists for E3 or E4, have a wide range of VOT values with no 
apparent mode. At 51 weeks, unlike E3 and E4, there is still a wide range of 
VOT values, -90 msec lead to +280 msec lag, although there is now a mode at -t-20 
msec lag. Thereafter up to 96 weeks of age, the mode remains at +10 or +20 msec 
voicing lag. A concentration of stops in the short voicing lag category does 
occur at 75 weeks, and then at 83 and 96 weeks long voicing lag stops again 
appear more frequently. E7 has more stops in the voicing lead range than E3 or 
E4 up to 51 weeks; after 64 weeks such stops also occur infrequently. 

This data can be collapsed by categorizing stops into the d-range or t-range 
as previously defined. Thus a graphic representatior of stops in the t-range as 
a percentage of the total number of stops at each age characterizes the develop- 
mental sequence in which stops representative of the adult models of /d/ and /t/ 
are observed. 

Graphs of this type for E3, E4, and E7 are presented in Figures 5, 6, and 7. 
Subjects E3 and E4 have a similar developmental pattern from the early sessions 
(one year) to two years (102 weeks). At about one year, only 15% of stops pro- 
duced are in the adult t-range. This percentage gradually increases until by 
two years the percentage is over 50%. The drop in the percentage by E4 for 111 
and 125 weeks was the result of a distinct change in vocal behavior. Before two 
years of age, vocalizations during the half-hour recording sessions were par- 
tially babbling and partially recognizable speech with the attention of the child 
constantly changing. In later sessions, however, almost all vocalizations were 
recognizable speech and E4's attention was centered through almost the entire 
session on a single play activity which happened to involve "dishes." Thus the 
percentage of stops in the t-range for the older sessions is representative of 
data that is qualitatively different from that of the younger sessions. 

Chronologically, the pattern of development for E7 is not similar to that 
of E3 and E4. The broad distribution of VOT values observed at 34 weeks is 
divided half into the t-range, half into the d-range. The percentage in the 
t-range then slowly falls to 12% at 75 weeks. The percentage increases in 
following sessions, but at almost two years is only 30% compared to over 50% for 
E3 and E4. 

Although there are chronological differences between E7 and the other sub- 
jects, we may interpret the data for all three children from another point of 
view. In particular, by drawing on other developmental and physiological data, 
we will suggest that there is a single s xential pattern of development which 
describes the data for all three subjects. According to this interpretation, 
E7 lags in time behind E3 and E4. It vas, in fact, the opinion of the experi- 
menters that the overall language development of E7 lagged considerably behind 
that of Ei and E4. This includes further observations of E7 until she was 2-1/2 
years old — a time period extending beyond that of data collection. 
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DISCUSSION 



In this section we will develop two hypotheses based on physiology which 
will be useful for Interpreting the Infant data. The hypotheses are that 
although It Is Inherently difficult for an Infant to control the timing between 
stop release and the onset of vocal fold oscillation, an Infant (learning 
American English) can produce a short voicing lag stop, like /d/, more easily 
than a long voicing lag stop, like /t/. 

At least three separate artlculatory gestures with separate Innervations 
are needed to produce a stop consonant; these are the articulations to permit 
stop closure and release, to Isolate the nasal cavities at the velum, and to 
Initiate vocal fold oscillation. Other artlculatory gestures In the vocal tract 
may also be used by adults to produce stops. However, from the point of view of 
an Infant learning to produce stops, It would appear that control at the point 
of articulation, the velum, and the larynx must necessarily come first. 

10^7 authors agree with the position of Llsker and Abramson (1964, 

19b7, 1971) and Rothenberg (1968) that the contrastlve differences In the voic- 
ing dimension of stops are primarily the result of differences In the timing of 
glottal articulation relative to supraglottal articulation. We propose that 
distinct physiological mechanisms underlie the production of stops within ear h 
ot the three voice onset time categories and, further, that stops In the short 
voicing lag category are easier to produce than stops in the other two categories. 

First we will examine the hypothesis that the Infant needs to learn only 
one type of apical artlculatory gesture for the production of ap<cal stops 
regardless of the VOT category. Studies of adults do not reveal any essential 
differences lii effecting artlculatory closure for stops differing with respect 
to voicing. In a palatography study by Fuji! (1970) of the dynamic placement 
of the tongue against the palate, Japanese /d/ and /t/ were considered to belong 
to a single artlculatory class (compared to other consonants), although there 
were small, consistent differences between them. In other studies of labial 

^"^^ (^965) and Fromkin (1966) investigated electromyographic 

(EMG) signals from the primary muscle of articulation for labials, the oblcularls 
oris, and round only insignificant differences in peak EMG strength for English 
/P/ and /b/. Lubker and Parrls (1970:632), using Simultaneous measurements of 
EMG and force of labial contact, found the labial gestures for /p/ and /b/ 
essentially monotyplc." Measurements of closure duration for American English 
/p/ and /b/ by Lubker and Parrls (1970), and Dutch /p/ and /b/ by Slls (1970), 
found durations to be the same in initial oosltlon, varying from 100 to 150 msec 
depending on context. Although these data concerning artlculatory closure is 
very Incomplete, we feel justified in assuming that an Infant could learn 
essentially one type of apical articulation and be able to produce stops in all 
VOT categories. 

The nasal cavities must be Isolated from the rest of the vocal tract in 
order to create the Intraoral pressure needed to produce a stop. Muscles 
attached to the velum and pharyngeal muscles act to close the velum against the 
pharyngeal wall. Many recent investigations have shown some differential activ- 
ity in the velopharynx for stops belonging to different VOT categories (Bertl 
and Hirose, 1972; Lubker et al., 1970). The relevance of these stales will be 
discussed presently. 
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For stop consonants In initial position, the glottal articulation that 
must be effected Is the adduction of the vocal folds from an open (rest) posi- 
tion to a closed, oscillatory position. Voice onset time measurements reflect 
the time at which the adduction of the vocal folds Is achieved relative to the 
stop release. For apical stops In the short voicing lag category, VOX measure- 
ments range from 0 to +20 msec. Direct observation of the larynx by fiberoptic 
techniques (Llsker et al. , 1970) confirm that the vocal folds have fully 
adducted and are oscillating at or very near the time of stop release. Thus, 
artlculatory gestures required to produce short voicing lag stops are velo- 
pharyngeal closure followed by the complete adduction of the vocal folds at the 
time of release of the supraglottal articulators, such that vocal fold oscilla- 
tion begins within 20 msec of release. 

In order to Initiate vocal fold oscillation, another factor must be con- 
sidered. Oscillation of adducted vocal folds Is the result of airflow through 
the glottis which In turn occurs when th&re Is a sustained pressure drop across 
the glottis. When the vocal tract Is unobstructed and the vocal folds are 
adducted, a wide range of transglottal pressure differentials and tensions In 
the vocal folds will result In oome sort of vocal fold oscillation. However, 
when the vocal tract Is obstructed, as during stop closure, and the vocal folds 
adducted, Rothenberg (1968:91) has argued that oscillation will not occur or be 
maintained unless special artlculatory mechanisms are employed to sustain a 
transglottal pressure drop. These mechanisms may Include active or passive 
enlargement of the supraglottal cavity, some nasal airflow and heightened sub- 
glottal pressure. Thus, If the vocal folds are adducted at any time during 
apical closure and additional muscle gestures are not made, vocal fold oscilla- 
tion will not begin until after the stop closure Is released. 

That Is to say, for an Infant to successfully produce short lag apical 
stops In Initial position, he may fully close the glottis any time during apical 
closure providing that the velopharyngeal closure merely Isolates the nasal 
cavities. However, to produce voicing lead stops, the Infant must complete 
glottal closure considerably before oral release and then Initiate and sustain 
vocal fold oscillation by the addition of other artlculatory mechanisms (suggested 
above). These might Include velopharyngeal adjustments other than simple velo- 
pharyngeal closure. 

Stops with long voicing lag are produced with the glottis open at the 
time of release according to fiberoptic Investigations (Llsker et al., 1970). 
For American English /t/, the onset of vocal oscillations In the Llsker and 
Abramson data In Figure 1 has a mean of +45 msec VOT. Llsker et al. (1970) show 
that the vocal folds become fully adducted a short period of time (approximately 
30 msec) after oscillation has begun. Kim (1970:111) and other researchers 
Indicate that It takes about 100 to 120 msec to fully adduct the vocal folds 
from their Initial open position. Considering these data, an Infant will suc- 
cessfully produce a long voicing lag stop if he leaves the glottis open through- 
out apical closure and then initiates vocal fold adduction approximately at 
stop release, having maintained velopharyngeal closure throughout. We note that 
the gesture for velopharyngeal closure could be approximately the same for the 
Infant to produce short and long voicing lag stops, but it is likely to be differ- 
ent and more complex for the voicing lead stops. 

The range along the voice onset time continuum which the different VOT 
categories cover is also of Interest. For the long voicing lag stops, English 
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closure, and the onset of vocal fold oscillation must be concluded in certain 
specified orders within a very short pteriod of time if a stop consonant is to 
be successfully produced. Research on nonhuman primates further suggests that 
thib is not a simple task. Lieberman et al. (1970) state that stops have not 
been observed spontaneously in the vocalizations of primates, although primates, 
especially Rhesus monkeys and chimpanzees, are thought to have the gross physio- 
logical capability to produce stops. Furthermore, it has proven extremely 
difficult to train chimpanzees to produce any stops. ^ Research with young deaf 
children has shown that even when some stops are produced by these children, 
they are not successful in learning to make stops in more than one voice onset 
time category (Stark, 1971), typically the short voicing lag category. Thus, 
we may expect infants to exhibit difficulty in learning to control the timing 
of the articulatory gestures of stop consonants. 

Tracing the development of stops from birth, there is evidence that 
neonatal humans do not prod ce stops in their vocalizations (Lieberman et al., 
1968). In our own observation, it is not until about 6 months of age that 
infants will produce enough stops to yield even a small number from a half-hour 
recording. This is about the age of E7*s^oungest data. E7 was recorded weekly 
from 29 weeks of age, but not until 34 weeks were more than five apical stops 
recorded in one session. E7*s distribution at 34 weeks may be characterized as 
spanning a wide range of VOT, -120 to +60 msec, and as being approximately uni- 
form over the interval. 

Data was not collected for E3 or E4 at these early ages, but another sub- 
ject was recorded weekly from the even younger age of 26 to 31 weeks of age. 
This subject produced almost exclusively velar stops, and not very many in most 
sessions. At 30 weeks, however, twenty-three velar stops were analyzed and 
gave a distribution which was clearly uniform from -160 msec voicing lead to 
+160 msec voicing lag. 

If we allow a tentative generalization on the basis of these two subjects, 
the results suggest that the earliest attempts to produce stops give voice 
onset time measurements randomly distributed over a wide range along the VOT 
continuum. We note that although the distributions were i.lmilar, for one sub- 
ject the stops were predominantly apical, while for the other subject they were 
velar. Such a distribution shows no Specific patterning after the adult models 
of /d/ and /t/ — or /g/ and /k/ in the other subject — which raises several 
questions. First, is there reason to expect distributions of stops from their 
earliest occurrences in infant vocalizations to reflect the adult model? An 
affirmative answer might be inferred from data which indicate that infants may 
attend to the adult model of stops prior to producing any of their own. Some 
recent studies have shown that within the first few months of life infants dis- 
criminate sounds which are identified by English-speaking adults as different 
stop consonants. Specifically, Eimas et al. (1971) showed that infants of 4 
weeks and 4 months are able to sort synthetic stop consonants into categories 
similar to those of the adult phonemic model of voicing* Morse (1971) further 



See Emily Hahn's (1971) review of several attempts to teach chimpanzees to 
talk. 
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avallable to the infant and becauL wltJ^^ ."'^P' """" ^^"^^^ to be 

-tlon, no stops would be produced at lu SL' 1^^ linguistic) Infor- 

why no evidence of the adult model ±1 ttZL t^""^^ "^^'^ question of 

au±c model Is present In the early distributions. 

Preceding discussion rules nut ^^ „ 
stops are produced spontaneouslv W ^''f possibility that 

distributed with respect r?o1\:aL2e;;t% A."\'^^ J"^' "^^^ 
difficulty of timing m stop artrcuStSnf ; J^u^^^ inherent 
first attempts to produce sLpfwr UttL ! ""°"8h that an Infant's 

This was one of the hypotheses pr^enJ^ Jn ^hr^^'f '° ^'"'^ ""^^^l^' 
we reason that an Infant's earUesra«Li?« ^ Physiological discussion. Thus, 
trlbuted along the VOT continuum because L is uL'm^^ uniformly dls- 

between the muscle gestures at the o1 l^.Zlll.Tn L^d'the 

a latSX': iHtop^^X'XlsT^ ^"'^ productions, we propose that 

Short voicing lag category oHS^ T^o of "'^^ °' ^"'"^ «^°P« «^he 
this type of distribution when we* fl^? " ^d E4, already had 

weeks of age. E7 did not produce a comoarfbll HTJ^^f f '^""^ ^^^"^ 50 
nearly 6 months later, it can be « . distribution until 75 weeks, 
makes steady progress from he bro^d dlstrrbull ' °^ ^ ^^^^ 

in the short voicing lag categor^ J ^ f '° ^ concentration of stops 

data at 75 weeks usfng f s^te •rec:rdi:gM E^t^y/" T "^^^^^^^^^ the 
apical stops analyzed, 98Z were m thrsLft^volcl^g Ug^c^te'goryf '"'^■^''^ 

gory the short voicing lag cate- 

the physiological discussion!' Jhl inSnt InltJ^n ""T "^^^^^^^^ light of 
coordinated timing of gestures needeS to dJ?J' "° the 

At some point, however, we assuJ^e . f^^'f^'^ VOT categories, 

which will match those'of the ^T aS^lt iode?^"^ '""^ '° P"^"" «^°P« 

trol over different articulations ftr lTcJ .r ^° f""!"^ "^^ ^'^^^^^^ 
stops m the t-range. It appeals Ztlt] f °P^^" ^^e d-range than for apical 
these articulations slmultaSeou^L^^t ^'^'l"^" '^^"^"l over 

i.e., the short voicing lag stop " acquires the easier one first. 
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Following this stage, stops were added in the t-range, but without a mode 
characteristic of adult /t/. By this time small numbers of recognizable 
English words were present in the recordings. Therefore, to carry the investi- 
gation further, a small study concerning only words was carried out. 

E4 was the child recorded the longest so her data were examined for words 
beginning with /d/ or /t/. Words were accepted if two adults could easily 
identify them as English words beginning with "acceptable" infant productions 
of /d/ or /t/, i.e., not obviously in error with respect to voicing. The entire 
utterance in which the /d/ or /t/ word occurred was written down in its approxi- 
mate English equivalent (for example, "Doggie see fish") for presentation to 
other subjects. 

The earliest age for which we could identify a few /d/ and /t/ words was 
96 weeks. We also selected words for 111 weeks (2-1/4 years) and 125 weeks 
(2-1/2 years). For comparison, all of E4*s utterances were repeated by four 
children and three male adults. The children's ages were 3-1/2 years (one child) 
and 4-1/2 years (three children). For the children, an experimenter or the 
mother read from the respective lists of E4*s utterances which the child was 
asked to repeat correctly. The adults simply read the utterance lists. Spec- 
trograms were made of all utterances, and the VOT distributions were made separ- 
ately for the /d/ and /t/ words for each individual subject. This procedure 
was used in hope that effects of context would be controlled across subjects. 
Following analysis, the three adult distributions were essentially identical, 
so only one is reported on here. 

The data for all subjects are presented in Figures 8, 9, and 10 correspond- 
ing to the utterances for E4 at 96, 111, and 125 weeks of age, respectively. 

Consider the distributions for /d/ words. Since there were only two /d/ 
words at 96 weeks, results are mainly based on distributions at 111 and 125 
weeks. The /d/ distributions for all five children for both 111 and 125 weeks 
are the same; thus, remarks on the /d/ data refer in connon to Figures 8, 9, and 
10. All children's distributions bear basic similarities to thoee of the adult, 
with small differences. Adult JD has a VOT range for /d/ of +10 to +40 msec, 
with only lOJS of the /d/*s falling in the t-range. E4 has a VOT range of 0 to 
+40 msec, with 25% of the /d/*s falling in the t-range. The four older children 
have a range for /d/ of 0 to +100 msec, again with 25% of the /d/*s in the 
t-range. (No /d/*s with voicing lead occurred.) We thus conclude that distri- 
butions for /d/ are the same from the earliest word productions to at least 
4-1/2 years of age. The /d/ distributions are quite similar to that of the 
adult model, but children show considerably more error in producing /d/ words 
with VOT values in the t-range. 

For the /t/ words, there is a difference in the distribution for E4 at 96 
weeks compared to the two older distributions. E4*s /t/ words at 96 weeks have 
a mean of +40 msec and a range of +20 to +60 msec. These values are signifi- 
cantly smaller than those of adult JD*s /t/ distribution, which has a mean of 



This study was reported in part at the 79th meeting of the Acoustical Society 
of America and in Preston and Port (1969) . 
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Fig. 8 



Stop distributions for each subject corresponding to the 
repetitions of the /d/ and /t/ words produced by E4 at 
96 weeks of age. 
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Stop distributions for each subject corresponding to the 
repetitions of the /d/ and /t/ words produced by E4 at 
111 weeks of age. 
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^hl 7?/ d^«? ^w/ 5f ^' ^^^^ the d-range. 

dlLrJh JJ^''^^^"^^^" f 96 weeks. In fact, lies midway between adult /d/ and /t/ 

fduf^ "JaJ""' ° ""T^ ^" Therefore, according to listening ests for 

on ti! t ."""T -ords) were judge 

7d/ or /t/ " ""^""^ ""^^^ """^'^ ambiguously categorized as 

/\//i«^'^i'''"tion changes radically by 111 weeks of age and is by then 
similar to all other ft/ distributions for all the children (except E4 at 96 
weeks). The characteristic /t/ distribution has a wide range. (0 to +290 msec 
llL rTf but with a mean not significantly different from adult JD's 

mean of +65 msec for all /t/ distributions. The distributions have no "oparent 
mode, with the possible exception of subject KI (4-1/2 years). For all children, 
very few of the /t/ words Intrude into the d-range. The children's /t/ distri- 
butions contrast with that of adult JD which has a narrower range of VOT values. 
+30 to +100 msec, and a mode at +60 msec. It is also of interest to note that 

s /t/ words actually account for the major portion of all apical stops 
collected at 96. 111. and 125 weeks that have VOT greater than +30 msec. 

A number of conclusions can be drawn from these results. When children 
begin to use /t/ words, they distinguish their productions functionally from 
/d/ words along the dimension of voice onset time. However. VOT distributions 
for /t/ clearly deviate from the adult model. At the earliest age for which we 
could collect some /t/ words. E4's distribution lies ambiguously across the 
boundary between the adult /d/ and /t/ distributions, but with VOT values none- 
theless clearly larger than those for the majority of adults' or children's /d/ 
words. We cannot offer an explanation as to why this distribution is so differ- 
ent from the other children's /t/ distributions. It would not appear to be an 
artifact. The word toy." or "toys." occurs both at 96 and 111 weeks. For four 
utterances of toy(s)" at 96 weeks, the range of VOT is +20 to +40 msec; for ten 
occurrences of "toy(s)" at 111 weeks, the range is +20 to +140 msec. Thus, at 
96 weeks of age E4 has learned one aspect of the adult model for /t/. that /t/ 
should be produced with VOT greater than +25 msec. But she has not learned to 
produce apical stops with a large enough delay in the onset of voicing that they 
would be unambiguously categorized by adults as /t/ on the basis of VOT alone. 
However, since our adult listeners judged that the /t/ words clearly began with 
ItJ. it is possible that some other cue besides voice onset tlirr was belne 
effectively signaled at this age. 

By 111 weeks, however, a new pattern of /t/ production occurs which contin- 
ues to be characteristic of E4's and of our other children's speech through 

. !,IJ^^" °^ ^^^^ distribution characteristically has a wide range 

It I ^"'^ "° ^^^^ ^t^Se. two of the most Important aspects of 

the adult VOT model have been acquired: the VOT values are greater than +25 msec, 
and the vast majority of the VOT values are large enough to be unambiguously 
categorized as /t/ by adult listeners. However, the adult /t/ VOT range and mode 
are still to be acquired. We have proposed an explanation of why the /t/ range 
is wide: the production of /t/ is a difficult .nd complex articulation, wherein 
the timing evidenced by the adult model just be finely controlled. What is sur- 
prising, however, is that at 4-1/2 years, two years after E4's 111-week distri- 
bution, no particular change in the /t/ distribution occurs. That is. we know 
that children will eventually control their /t/ productions wlth!n the limits of 
the adult model, but this control has not been achieved as late as 4-1/2 years. 
Jacqueline Sachs (personal communication) has some roughly comparable VOT data* 
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for /p/ which shows that by 5 years of age, half of her six subjects have a 
mode and a more restricted range of VOT values. It appears, then, that acquisi- 
tion of the adult model for /t/ does not occur until after 5 years of age. 

Summarizing, we have traced in some detail the development of apical stop 
consonants with respect to VOT from 1 to 2 years of age and have included 
supplementary data cove- .g birth to 4-1/2 years of age. From the physiology 
of stop production, two hypotheses were supported, i.e., that control over 
timing in stop articulation is inherently difficult and that English /d/ is 
easier to produce than /t/. A sequential pattern of the development of apical 
stops with respect to VOT is suggested incorporating these hypotheses. 

No stops are observed in neonatal vc -^:lizations . When stops first appear 
around 6 months of age, the VOT distribut.on has a wide range of randomly dis- 
tributed values extending from voicirg le=.d to long voicing lag. This indicates 
an infant s inability to control timii^.g between the supraglottal and glottal 
articulatory gestures. Control over timing for the apical stop is achieved 
first for the short voicing lag category. Apical stops in the long voicing lag 
category are then gradually added. When words beginning with /d/ and /t/ are 
first observed—about 2 years of age for subject E4~the characteristics of 
these distributions remain constant until at least 4-1/2 years. The distribu- 
tion for /d/ looks similar to the adult's distribution but with more errors into 
the t-range. lue /t/ word distributions bear less resemblance to the adult's, 
having a wide range of VOT values with no mode, although there are few errors 
into the d-range. 
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ABSTRACT 



The Discrimination of Speech and Nonspeech Stimuli in Early Infancy* 

Philip Allen Morse"*" 

Haskins Laboratories, New Haven 

gatefin'i'nfaiL'^ASlA f ^^"J^^^i^^P^^ch and nonspeech stimuli was investi- 
fnrl infants 40-54 days of age by means of a nonnutritive conjugate suck- 
audiJ^rvl^r*! ^^"'^/""P^ °^ Ss were given repeated presentatLns'of one 
(postsMft? sm"'i ' "P°" '^^I'i^-ting to it. were shifted to a second 

differed IccordTni"^ "^T" ' ^''""^ P"" Postshift stimuli 

aittered according to place of articulation ([ba-l vs. fea-n rrouo t 

"s^" ba:r're ^ f"nf ^^^^^ consls^img or; ifffiLceTintonation 
presentld'wl^h • J^?^ "^^'^"^ intonation). Group C (Control) was 

GrZlT^t stimulus during preshift and postshifi ([ba-]). For 

Groups NS (Nonspeech control) the pre- and postshift stimuli consisted of the 
isolated acoustic cues which differentiate the place stimuli ^ba] and [ga] 
Changes in hi-amplitude sucking revealed that infants 40-54 diys of age cl; 
F;r h'eriore ^""T'' '^^^ ^'^^ °^ articnlation and intonftion. 

gested that'inf^'' °" f nonspeech control conditions sug- 

r^L:fnt Iner ' '° ^^^^ ^^"^ ^" ^ linguisticflly 
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ABSTRACT 



The Effect of Delayed Channel on the Perception of Dichotically Presented 
Speech and Nonspeech Sounds* 

Robert John Porter, Jr."*" 
Haskins Laboratories, New Haven 

Previous investigations have found that subjects identify the temporally 
lagging member of a pair of dichotically as3mchronous stop consonant-vowel 
syllables with greater accuracy than the leading member. This advantage for 
the lagging syllable has been termed the "lag effect." Three studies are 
reported which examined the possibility that this effect is a manifestation of 
the special processes required for the perception of the acoustically encoded 
stop consonants rather than a general effect to be found for several types of 
acoustic events. 

In the first two experiments, the effects observed for the syllables 
[bae, dae, gae] were compared to those obtained with nonspeech sounds which were 
acoustically comparable to the syllables but had been previously shown not to 
requir the same special perceptual processing. Two types of nonspeech sounds 
were used: (1) the acoustically isolated second formants of the syllables 
("bleats"), and (2) acoustically isolated second--formant transitions ("chirps"). 
Three groups of subjects received dichotically asynchronous pairs of syllables, 
bleats, or chirps. Twelve stimulus--onset asynchronies, from 0 to 165 msec, 
were used. 

If the lag effect is a general phenomenon of dichotic listening then the 
nonspeech would be expected to display lag effects similar to those observed 
for syllables. This did not appear to be the case. Whereas large and reliable 
lag effects were found for the syllables at asynchronies less than 120 msec 
(maximal at 60 msec), the lag advantages of the nonspeech controls were very 
small and variable. The chirps, in some cases, even displayed lead advantages. 

The results for the nonspeech signals were interpreted in terms of dichofic 
masking effects such as are observed in nonspeech auditory masking studies. 
The considerably larger and more reliable lag effects for the stop-vowel syl- 
lables were seen as indicating that the perceptual processing of these signals 
is particularly sensitive to the conditions of dichotic asynchronous competi- 
tion. It was argued that this peculiar sensitivity was a manifestation of the 
special "speech mode" processing known to be required for the perception of the 
highly acoustically encoded stop consonants. 



Dissertation submitted in partial fulfillment of the requirements for the 
degree of Doctor of Philosophy, University of Connecticut. 

Currently Kresge Hearing Research Laboratory, Louisiana State University, 
New Orleans. 



In order to examine further the suggested relationship between the lag 
effect and perception in the speech mode, a third experiment compared the 
results obtained with stops (in syllables [ba, da, ga]) to those obtained for 
a liquid and two semi-vowels (in syl?ables [la], [wa], [ja]). The liquids 
and semi-vowels tend to be acoustically less encoded than the stops and, pre- 
sumably, require special processing to a lesser degree. Previous studies had 
demonstrated that steady-state vowels, which can be shown to be less encoded 
than liquids and semi-vowels, tend not to yield lag effects. If the lag 
effect is a consequence of the involvement of special decoding processes, the 
liquids and semi-vowels would be expected to display lag effects to a lesser 
degree than stops and to a greater degree than vowels. 

The procedures and asynchronies used were the same as for the first two 
experiments. Eleven subjects received both the stop and the liquid and semi- 
vowel dichotic tests. 

The results for the liquid and semi-vowels were consistent with expectation. 
Five subjects displayed lag effects similar to those they displayed for stops. 
The results for the six remaining subjects were similar in several respects to 
those which had been previously observed for chirps and vowels. Apparently, 
these "intermediately" encoded speech sounds may in some circumstances be per- 
ceived like the stops and in other circumstances like vowels and nonspeech. 

Taken together, the results of all three experiments suggest that the lag 
effect Is not a general phenomenon of dichotic listening but is specifically 
associated with the perception of encoded speech sounds. As such, the effect 
is a possibly valuable source of information concerning the character of these 
special decoding processes. 
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ABSTRACT 



Phonetic Coding of Kanji 

Donna Erickson , Ignatius G. Mattingly , and Michael Turvey 

An experiment in the short-term recall of visually presented Japanese Kanji 
ideograms suggests that Kanji may, like alphabetic words, be encoded phonetical- 
ly, despite their lack of phonetic structure. The experiment, based on Kintsch 
and Buschke's (1969) paradigm, assumed that similarity of items in a list 
increased errors in recall. Four lists were prepared, each containing sixteen 
different Kanji. The first included phonetically similar pairs of characters; 
the second, semantically similar pairs; the third, visually similar pairs; the 
fourth was a control list containing no similar pairs. The subjects, ten native 
speakers of Japanese, were presented with randomly ordered versions of each 
list, at one character per second. After a subject had seen an entire list, he 
was presented with a cue character selected from the list and asked to recall 
the character which had been presented immediately before the cue. Confusion in 
primary memory was significantly greater for the phonetic list than for the 
other lists. These results strengthen the hypothesis that regardless of struc- 
ture, visually presented linguistic itemy are, like speech itself, phonetically 
processed. 
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